关键词: HPRC Human pangenome Minimap2 Minimap2 index modification NGS Read alignment

Mesh : Humans Whole Genome Sequencing / methods Genome, Human Genetic Variation / genetics High-Throughput Nucleotide Sequencing / methods Polymorphism, Single Nucleotide / genetics Sequence Alignment / methods Software Algorithms Genome-Wide Association Study / methods

来  源:   DOI:10.1186/s12859-024-05862-y   PDF(Pubmed)

Abstract:
BACKGROUND: Alignment of reads to a reference genome sequence is one of the key steps in the analysis of human whole-genome sequencing data obtained through Next-generation sequencing (NGS) technologies. The quality of the subsequent steps of the analysis, such as the results of clinical interpretation of genetic variants or the results of a genome-wide association study, depends on the correct identification of the position of the read as a result of its alignment. The amount of human NGS whole-genome sequencing data is constantly growing. There are a number of human genome sequencing projects worldwide that have resulted in the creation of large-scale databases of genetic variants of sequenced human genomes. Such information about known genetic variants can be used to improve the quality of alignment at the read alignment stage when analysing sequencing data obtained for a new individual, for example, by creating a genomic graph. While existing methods for aligning reads to a linear reference genome have high alignment speed, methods for aligning reads to a genomic graph have greater accuracy in variable regions of the genome. The development of a read alignment method that takes into account known genetic variants in the linear reference sequence index allows combining the advantages of both sets of methods.
RESULTS: In this paper, we present the minimap2_index_modifier tool, which enables the construction of a modified index of a reference genome using known single nucleotide variants and insertions/deletions (indels) specific to a given human population. The use of the modified minimap2 index improves variant calling quality without modifying the bioinformatics pipeline and without significant additional computational overhead. Using the PrecisionFDA Truth Challenge V2 benchmark data (for HG002 short-read data aligned to the GRCh38 linear reference (GCA_000001405.15) with parameters k = 27 and w = 14) it was demonstrated that the number of false negative genetic variants decreased by more than 9500, and the number of false positives decreased by more than 7000 when modifying the index with genetic variants from the Human Pangenome Reference Consortium.
摘要:
背景:读段与参考基因组序列的比对是分析通过下一代测序(NGS)技术获得的人类全基因组测序数据的关键步骤之一。分析后续步骤的质量,如遗传变异的临床解释结果或全基因组关联研究的结果,取决于作为其对齐结果的读取位置的正确识别。人类NGS全基因组测序数据的数量在不断增长。全球有许多人类基因组测序项目,导致了测序人类基因组遗传变异的大规模数据库的创建。当分析为新个体获得的测序数据时,有关已知遗传变异的此类信息可用于提高读数比对阶段的比对质量。例如,通过创建基因组图。虽然用于将读段与线性参考基因组进行比对的现有方法具有高的比对速度,用于将读段与基因组图进行比对的方法在基因组的可变区中具有更高的准确性。考虑线性参考序列索引中的已知遗传变体的读段比对方法的开发允许组合两组方法的优点。
结果:在本文中,我们给出了minimap2_index_modifier工具,这使得能够使用特定于给定人群的已知单核苷酸变体和插入/缺失(indel)构建参考基因组的修饰索引。修改的minimap2指数的使用改善了变体调用质量,而不修改生物信息学管道,并且没有显著的额外计算开销。使用PrecisionFDATruthChallengeV2基准数据(对于与GRCh38线性参考(GCA_000001405.15)对齐的HG002短读数据,参数k=27和w=14),证明了假阴性遗传变异的数量减少了9500以上,并且使用来自HumanPangenomeReferenceConsortium的遗传变异修改指数时,假阳性数量减少了7000以上
公众号