关键词: L0Learn ensemble learning penalized regression polygenic risk scores

Mesh : Humans Multifactorial Inheritance / genetics Genome-Wide Association Study / methods Machine Learning Genetic Predisposition to Disease Polymorphism, Single Nucleotide

来  源:   DOI:10.1073/pnas.2403210121   PDF(Pubmed)

Abstract:
Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary-level data (ALL-Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL-Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large-scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL-Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20-fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL-Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL-Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state-of-the-art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL-Sum is available as a user-friendly R software package with publicly available reference data for streamlined analysis.
摘要:
多基因风险评分(PRS)可增强人群风险分层并推进个性化医疗,但是现有的方法面临着一些限制,涵盖与计算负担相关的问题,预测准确性,以及对广泛遗传结构的适应性。为了解决这些问题,我们建议使用汇总级数据(ALL-Sum)聚合L0Learn,一种快速且可扩展的集成学习方法,用于使用来自全基因组关联研究(GWAS)的汇总统计来计算PRS。ALL-Sum利用L0L2惩罚回归和跨调整参数的集成学习来灵活地对具有不同遗传架构的性状进行建模。在广泛的大规模模拟中,广泛的多遗传性和GWAS样本量,在预测准确性方面,ALL-Sum始终优于流行的替代方法,运行时,内存使用量减少10%,20倍,还有三个,分别,并证明了对不同遗传架构的稳健性。我们使用来自9个数据源的GWAS汇总统计数据验证了ALL-Sum在11个复杂性状的实际数据分析中的性能,包括全球脂质遗传学联盟,乳腺癌协会联合会,和FinnGen生物银行,在英国生物银行进行验证。我们的结果表明,平均而言,ALL-Sum获得的PRS平均准确度提高25%,比当前最先进的方法快15倍的计算速度和一半的内存,并且在广泛的特征和疾病中表现强劲。此外,当使用从不同数据源计算的连锁不平衡时,我们的方法显示出稳定的预测。ALL-Sum作为用户友好的R软件包提供,具有公开可用的参考数据,用于简化分析。
公众号