关键词: Artificial intelligence Deep learning Genomics Machine learning Phylogenomics Population genetics

来  源:   DOI:10.1016/j.ympev.2024.108142

Abstract:
Assigning a query individual animal or plant to its derived population is a prime task in diverse applications related to organismal genealogy. Such endeavors have conventionally relied on short DNA sequences under a phylogenetic framework. These methods naturally show constraints when the inferred population sources are ambiguously phylogenetically structured, a scenario demanding substantially more informative genetic signals. Recent advances in cost-effective production of whole-genome sequences and artificial intelligence have created an unprecedented opportunity to trace the population origin for essentially any given individual, as long as the genome reference data are comprehensive and standardized. Here, we developed a convolutional neural network method to identify population origins using genomic SNPs. Three empirical datasets (an Asian honeybee, a red fire ant, and a chicken datasets) and two simulated populations are used for the proof of concepts. The performance tests indicate that our method can accurately identify the genealogy origin of query individuals, with success rates ranging from  93 % to 100 %. We further showed that the accuracy of the model can be significantly increased by refining the informative sites through FST filtering. Our method is robust to configurations related to batch sizes and epochs, whereas model learning benefits from the setting of a proper preset learning rate. Moreover, we explained the importance score of key sites for algorithm interpretability and credibility, which has been largely ignored. We anticipate that by coupling genomics and deep learning, our method will see broad potential in conservation and management applications that involve natural resources, invasive pests and weeds, and illegal trades of wildlife products.
摘要:
将查询的单个动物或植物分配给其派生种群是与生物谱系相关的各种应用中的首要任务。这样的努力通常依赖于系统发育框架下的短DNA序列。当推断的种群来源是模糊的系统发育结构时,这些方法自然会显示出约束,一种需要更多信息遗传信号的情况。在具有成本效益的全基因组序列生产和人工智能方面的最新进展创造了一个前所未有的机会来追踪基本上任何给定个体的人口起源,只要基因组参考数据是全面和标准化的。这里,我们开发了一种卷积神经网络方法来使用基因组SNP识别种群起源。三个经验数据集(一只亚洲蜜蜂,一只红火蚂蚁,和一个鸡数据集)和两个模拟种群用于概念证明。性能测试表明,该方法能够准确识别查询个体的家谱来源,成功率从>93%到100%不等。我们进一步表明,模型的准确性可以通过FST过滤来改善信息站点来显着提高。我们的方法对于与批次大小和时期相关的配置是稳健的,而模型学习受益于设置适当的预设学习率。此外,我们解释了关键站点对算法可解释性和可信度的重要性评分,这在很大程度上被忽视了。我们预计,通过将基因组学和深度学习相结合,我们的方法将在涉及自然资源的保护和管理应用中看到广泛的潜力,入侵害虫和杂草,和野生动物产品的非法交易。
公众号