1000 Genomes

1000 个基因组
  • 文章类型: Journal Article
    SNP-based imputation approaches for human leukocyte antigen (HLA) typing take advantage of the haplotype structure within the major histocompatibility complex (MHC) region. These methods predict HLA classical alleles using dense SNP genotypes, commonly found on array-based platforms used in genome-wide association studies (GWAS). The analysis of HLA classical alleles can be conducted on current SNP datasets at no additional cost. Here, we describe the workflow of HIBAG, an imputation method with attribute bagging, to infer a sample\'s HLA classical alleles using SNP data. Two examples are offered to demonstrate the functionality using public HLA and SNP data from the latest release of the 1000 Genomes project: genotype imputation using pre-built classifiers in a GWAS, and model training to create a new prediction model. The GPU implementation facilitates model building, making it hundreds of times faster compared to the single-threaded implementation.






  • 文章类型: Journal Article
    Currently, the genetic variants strongly associated with risk for Multiple Sclerosis (MS) are located in the Major Histocompatibility Complex. This includes DRB1*15:01 and DRB1*15:03 alleles at the HLA-DRB1 locus, the latter restricted to African populations; the DQB1*06:02 allele at the HLA-DQB1 locus which is in high linkage disequilibrium (LD) with DRB1*15:01; and protective allele A*02:01 at the HLA-A locus. HLA allele identification is facilitated by co-inherited (\'tag\') single nucleotide polymorphisms (SNPs); however, SNP validation is not typically done outside of the discovery population. We examined 19 SNPs reported to be in high LD with these alleles in 2,502 healthy subjects included in the 1000 Genomes panel having typed HLA data. Examination of 3 indices (LD R2 values, sensitivity and specificity, minor allele frequency) revealed few SNPs with high tagging performance. All SNPs examined that tag DRB1*15:01 were in perfect LD in the British population; three showed high tagging performance in 4 of the 5 European, and 2 of the 4 American populations. For DQB1*06:02, with no previously validated tag SNPs, we show that rs3135388 has high tagging performance in one South Asian, one American, and one European population. We identify for the first time that rs2844821 has high tagging performance for A*02:01 in 5 of 7 African populations including African Americans, and 4 of the 5 European populations. These results provide a basis for selecting SNPs with high tagging performance to assess HLA alleles across diverse populations, for MS risk as well as for other diseases and conditions.






  • 文章类型: Journal Article
    The VISAGE Enhanced Tool for Appearance and Ancestry (ET) has been designed to combine markers for the prediction of bio-geographical ancestry plus a range of externally visible characteristics into a single massively parallel sequencing (MPS) assay. We describe the development of the ancestry panel markers used in ET, and the enhanced analyses they provide compared to previous MPS-based forensic ancestry assays. As well as established autosomal single nucleotide polymorphisms (SNPs) that differentiate sub-Saharan African, European, East Asian, South Asian, Native American, and Oceanian populations, ET includes autosomal SNPs able to efficiently differentiate populations from Middle East regions. The ability of the ET autosomal ancestry SNPs to distinguish Middle East populations from other continentally defined population groups is such that characteristic patterns for this region can be discerned in genetic cluster analysis using STRUCTURE. Joint cluster membership estimates showing individual co-ancestry that signals North African or East African origins were detected, or cluster patterns were seen that indicate origins from central and Eastern regions of the Middle East. In addition to an augmented panel of autosomal SNPs, ET includes panels of 85 Y-SNPs, 16 X-SNPs and 21 autosomal Microhaplotypes. The Y- and X-SNPs provide a distinct method for obtaining extra detail about co-ancestry patterns identified in males with admixed backgrounds. This study used the 1000 Genomes admixed African and admixed American sample sets to fully explore these enhancements to the analysis of individual co-ancestry. Samples from urban and rural Brazil with contrasting distributions of African, European, and Native American co-ancestry were also studied to gauge the efficiency of combining Y- and X-SNP data for this purpose. The small panel of Microhaplotypes incorporated in ET were selected because they showed the highest levels of haplotype diversity amongst the seven population groups we sought to differentiate. Microhaplotype data was not formally combined with single-site SNP genotypes to analyse ancestry. However, the haplotype sequence reads obtained with ET from these loci creates an effective system for de-convoluting two-contributor mixed DNA. We made simple mixture experiments to demonstrate that when the contributors have different ancestries and the mixture ratios are imbalanced (i.e., not 1:1 mixtures) the ET Microhaplotype panel is an informative system to infer ancestry when this differs between the contributors.






  • 文章类型: Journal Article
    In Brazil, high levels of agricultural activity are reflected in the consumption of enormous amounts of pesticides. The production of grain in Brazil has been estimated at 289.8 million tons in the 2022 harvest, an expansion of 14.7% compared with 2021. These advances are likely associated with a progressive increase in the occupational exposure of a population to pesticides. The Paraoxonase 1 gene (PON1) is involved in liver detoxification; the rs662 variant of this gene modifies the activity of the enzyme. The repair of pesticide-induced genetic damage depends on the protein produced by the X-Ray Repair Cross-Complementing Group 1 gene (XRCC). Its function is impaired due to an rs25487 variant. The present study describes the frequencies of the rs662 and rs25487 and their haplotypes in a sample population from Goiás, Brazil. It compares the frequencies with other populations worldwide to verify the variation in the distribution of these SNPs, with 494 unrelated individuals in the state of Goiás. The A allele of the rs25487 variant had a frequency of 26% in the Goiás population, and the modified rs662 G allele had a frequency of 42.8%. Four haplotypes were recorded for the rs25487 (G > A) and rs662 (A > G) markers, with a frequency of 11.9% being recorded for the A-G haplotype (both modified alleles), 30.8% for the G-G haplotype, 14.3% for the A-A haplotype, and 42.8% for the G-A haplotype (both wild-type alleles). We demonstrated the distribution of important SNPs associated with pesticide exposure in an area with a high agricultural activity level, Central Brazil.






  • 文章类型: Journal Article
    BACKGROUND: Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics.
    RESULTS: Here, we target general germline or somatic mutation data sources for their seamless inclusion within an interoperable-format repository, supporting integration among them and with other genomic data, as well as their integrated use within bioinformatic workflows. In addition, we provide VarSum, a data summarization service working on sub-populations of interest selected using filters on population metadata and/or variant characteristics. The service is developed as an optimized computational framework with an Application Programming Interface (API) that can be called from within any existing computing pipeline or programming script. Provided example use cases of biological interest show the relevance, power and ease of use of the API functionalities.
    CONCLUSIONS: The proposed data integration pipeline and data set extraction and summarization API pave the way for solid computational infrastructures that quickly process cumbersome variation data, and allow biologists and bioinformaticians to easily perform scalable analysis on user-defined partitions of large cohorts from increasingly available genetic variation studies. With the current tendency to large (cross)nation-wide sequencing and variation initiatives, we expect an ever growing need for the kind of computational support hereby proposed.






  • 文章类型: Journal Article
    To compile a new South Asian-informative panel of forensic ancestry SNPs, we changed the strategy for selecting the most powerful markers for this purpose by targeting polymorphisms with near absolute specificity - when the South Asian-informative allele identified is absent from all other populations or present at frequencies below 0.001 (one in a thousand). More than 120 candidate SNPs were identified from 1000 Genomes datasets satisfying an allele frequency screen of ≥ 0.1 (10 % or more) allele frequency in South Asians, and ≤ 0.001 (0.1 % or less) in African, East Asian, and European populations. From the candidate pool of markers, a final panel of 36 SNPs, widely distributed across most autosomes, were selected that had allele frequencies in the five 1000 Genomes South Asian populations ranging from 0.4 to 0.15. Slightly lower average allele frequencies, but consistent patterns of informativeness were observed in gnomAD South Asian datasets used to validate the 1000 Genomes variant annotations. We named the panel of 36 South Asian-specific SNPs Eurasiaplex-2, and the informativeness of the panel was evaluated by compiling worldwide population data from 4097 samples in four genome variation databases that largely complement the global sampling of 1000 Genomes. Consistent patterns of allele frequency distribution, which were specific to South Asia, were observed in all populations in, or closely sited to, the Indian sub-continent. Pakistani populations from the HGDP-CEPH panel had markedly lower allele frequencies, highlighting the need to develop a statistical system to evaluate the ancestry inference value of counting the number of population-specific alleles present in an individual.






  • 文章类型: Journal Article
    CONCLUSIONS: We developed PyLAE, a new tool for determining local ancestry along a genome using whole-genome sequencing data or high-density genotyping experiments. PyLAE can process an arbitrarily large number of ancestral populations (with or without an informative prior). Since PyLAE does not involve estimating many parameters, it can process thousands of genomes within a day. PyLAE can run on phased or unphased genomic data. We have shown how PyLAE can be applied to the identification of differentially enriched pathways between populations. The local ancestry approach results in higher enrichment scores compared to whole-genome approaches. We benchmarked PyLAE using the 1000 Genomes dataset, comparing the aggregated predictions with the global admixture results and the current gold standard program RFMix. Computational efficiency, minimal requirements for data pre-processing, straightforward presentation of results, and ease of installation make PyLAE a valuable tool to study admixed populations.
    METHODS: The source code and installation manual are available at https://github.com/smetam/pylae.






  • 文章类型: Journal Article
    Microhaplotype loci (microhaplotype, MHs), defined by two or more closely linked single nucleotide polymorphisms, are a type of molecular marker within a short segment of DNA. As emerging forensic genetic markers, MHs have no stutter artefacts and higher polymorphism, and permit the design of smaller amplicons. In order to identify the markers from a genome wide perspective and explore their potential application further, we constructed the most comprehensive MH dataset to date, based on the whole genome sequencing data of 105 Han individuals in Southern China from 1000 Genomes Project. The results showed that there were 9,490,075 MH loci in the range of 350 bp in the human genome, and the distribution density of microhaplotypes suggests gene variation. Polymorphism analysis of MHs from various base spans showed that the polymorphism of MHs could reach or exceed common short tandem repeat sites. In addition, based on their flexible assembly, a scheme to build the public database of microhaplotypes was proposed.
    微单倍型(microhaplotype, MH)是在一定DNA片段范围之内,由至少两个单核苷酸多态性位点组成的遗传标记。MH兼具无stutter伪峰、多态性丰富以及扩增子较小等特点,有望成为法医学上的一种新型遗传标记。为了从全基因组维度上分析MH的特征,进一步发掘其应用潜能,本研究基于千人基因组计划中105个中国南方汉族个体的全基因组测序数据,构建了迄今为止最全面的MH数据集。结果表明,人类基因组中350 bp范围之内的MH位点数量共计9,490,075个,且微单倍型分布密度对染色体变异水平具有提示作用。从多种碱基跨度范围对MH的多态性分析表明,其多态性潜能可达到或者超过常用短串联重复序列位点的水平。此外,本文归纳总结了MH组装灵活等特点,并提出了构建微单倍型数据库的方案。.






  • 文章类型: Journal Article
    We detail the development of the ancestry informative single nucleotide polymorphisms (SNPs) panel forming part of the VISAGE Basic Tool (BT), which combines 41 appearance predictive SNPs and 112 ancestry predictive SNPs (three SNPs shared between sets) in one massively parallel sequencing (MPS) multiplex, whereas blood-based age analysis using methylation markers is run in a parallel MPS analysis pipeline. The selection of SNPs for the BT ancestry panel focused on established forensic markers that already have a proven track record of good sequencing performance in MPS, and the overall SNP multiplex scale closely matched that of existing forensic MPS assays. SNPs were chosen to differentiate individuals from the five main continental population groups of Africa, Europe, East Asia, America, and Oceania, extended to include differentiation of individuals from South Asia. From analysis of 1000 Genomes and HGDP-CEPH samples from these six population groups, the BT ancestry panel was shown to have no classification error using the Bayes likelihood calculators of the Snipper online analysis portal. The differentiation power of the component ancestry SNPs of BT was balanced as far as possible to avoid bias in the estimation of co-ancestry proportions in individuals with admixed backgrounds. The balancing process led to very similar cumulative population-specific divergence values for Africa, Europe, America, and Oceania, with East Asia being slightly below average, and South Asia an outlier from the other groups. Comparisons were made of the African, European, and Native American estimated co-ancestry proportions in the six admixed 1000 Genomes populations, using the BT ancestry panel SNPs and 572,000 Affymetrix Human Origins array SNPs. Very similar co-ancestry proportions were observed down to a minimum value of 10%, below which, low-level co-ancestry was not always reliably detected by BT SNPs. The Snipper analysis portal provides a comprehensive population dataset for the BT ancestry panel SNPs, comprising a 520-sample standardised reference dataset; 3445 additional samples from 1000 Genomes, HGDP-CEPH, Simons Foundation and Estonian Biocentre genome diversity projects; and 167 samples of six populations from in-house genotyping of individuals from Middle East, North and East African regions complementing those of the sampling regimes of the other diversity projects.






  • 文章类型: Journal Article
    The aim of this study is to analyze the worldwide distribution of SNP rs4870723 in COL14A1 gene to check if there are significant genetic differences among different populations and to test if the gene is a trait under selection.
    Genomic DNA was extracted from 69 unrelated individuals from Sardinia and genotyped for SNP rs4870723. Data were compared with 26 different populations, clustered in 5 super-populations, from the public 1000 genomes database. Allele frequency and heterozygosity were calculated with Genepop. The Hardy-Weinberg equilibrium and pairwise population differentiation through analysis of molecular variance (AMOVA FST) were determined with Arlequin.
    Allele frequencies of COL14A1 rs4870723 were compared in 27 populations clustered in 5 super-populations. All populations were in the Hardy-Weinberg equilibrium. In almost all populations, allele C was the most frequent allele, reaching the highest values in East Asia. The 27 populations showed an appreciable structure, with significant differences observed between European, African, and Asian populations.
    Significant differences were observed in the rs4870723 SNP distribution among the populations studied. However, we found no evidence for a selective pressure. Rather, the differentiation among the populations is likely the result of founder effect, genetic drift, and cultural factors, all events known to establish and maintain genetic diversity between populations.






