关键词: SNP genotyping SOM imputation machine learning missing data

来  源:   DOI:10.1111/1755-0998.13992

Abstract:
Current methodologies of genome-wide single-nucleotide polymorphism (SNP) genotyping produce large amounts of missing data that may affect statistical inference and bias the outcome of experiments. Genotype imputation is routinely used in well-studied species to buffer the impact in downstream analysis, and several algorithms are available to fill in missing genotypes. The lack of reference haplotype panels precludes the use of these methods in genomic studies on non-model organisms. As an alternative, machine learning algorithms are employed to explore the genotype data and to estimate the missing genotypes. Here, we propose an imputation method based on self-organizing maps (SOM), a widely used neural networks formed by spatially distributed neurons that cluster similar inputs into close neurons. The method explores genotype datasets to select SNP loci to build binary vectors from the genotypes, and initializes and trains neural networks for each query missing SNP genotype. The SOM-derived clustering is then used to impute the best genotype. To automate the imputation process, we have implemented gtImputation, an open-source application programmed in Python3 and with a user-friendly GUI to facilitate the whole process. The method performance was validated by comparing its accuracy, precision and sensitivity on several benchmark genotype datasets with other available imputation algorithms. Our approach produced highly accurate and precise genotype imputations even for SNPs with alleles at low frequency and outperformed other algorithms, especially for datasets from mixed populations with unrelated individuals.
摘要:
当前的全基因组单核苷酸多态性(SNP)基因分型方法会产生大量的缺失数据,这些数据可能会影响统计推断并偏向实验结果。基因型插补通常用于经过充分研究的物种,以缓冲下游分析的影响,和几种算法可用于填补缺失的基因型。缺乏参考单倍型组排除了在非模型生物的基因组研究中使用这些方法。作为替代,机器学习算法用于探索基因型数据并估计缺失的基因型。这里,我们提出了一种基于自组织映射(SOM)的插补方法,一种广泛使用的神经网络,由空间分布的神经元形成,将相似的输入聚集成接近的神经元。该方法探索基因型数据集以选择SNP基因座以从基因型中构建二元载体,并为每个查询缺失的SNP基因型初始化和训练神经网络。然后使用SOM衍生的聚类来估算最佳基因型。为了自动化估算过程,我们已经实施了gtImputation,一个用Python3编程的开源应用程序,并具有用户友好的GUI以促进整个过程。通过比较其准确性,验证了该方法的性能,使用其他可用的插补算法对几个基准基因型数据集的精度和灵敏度。我们的方法产生了高度准确和精确的基因型插补,即使对于具有低频率等位基因的SNP,优于其他算法,特别是对于来自具有无关个体的混合群体的数据集。
公众号