关键词: Biomarkers Feature selection Machine learning Medical genetics

Mesh : Humans Coronary Artery Disease / genetics Genetic Predisposition to Disease Genome-Wide Association Study / methods Risk Factors Genetic Risk Score Machine Learning Genomics

来  源:   DOI:10.1186/s12967-024-05090-1   PDF(Pubmed)

Abstract:
Machine learning (ML) methods are increasingly becoming crucial in genome-wide association studies for identifying key genetic variants or SNPs that statistical methods might overlook. Statistical methods predominantly identify SNPs with notable effect sizes by conducting association tests on individual genetic variants, one at a time, to determine their relationship with the target phenotype. These genetic variants are then used to create polygenic risk scores (PRSs), estimating an individual\'s genetic risk for complex diseases like cancer or cardiovascular disorders. Unlike traditional methods, ML algorithms can identify groups of low-risk genetic variants that improve prediction accuracy when combined in a mathematical model. However, the application of ML strategies requires addressing the feature selection challenge to prevent overfitting. Moreover, ensuring the ML model depends on a concise set of genomic variants enhances its clinical applicability, where testing is feasible for only a limited number of SNPs. In this study, we introduce a robust pipeline that applies ML algorithms in combination with feature selection (ML-FS algorithms), aimed at identifying the most significant genomic variants associated with the coronary artery disease (CAD) phenotype. The proposed computational approach was tested on individuals from the UK Biobank, differentiating between CAD and non-CAD individuals within this extensive cohort, and benchmarked against standard PRS-based methodologies like LDpred2 and Lassosum. Our strategy incorporates cross-validation to ensure a more robust evaluation of genomic variant-based prediction models. This method is commonly applied in machine learning strategies but has often been neglected in previous studies assessing the predictive performance of polygenic risk scores. Our results demonstrate that the ML-FS algorithm can identify panels with as few as 50 genetic markers that can achieve approximately 80% accuracy when used in combination with known risk factors. The modest increase in accuracy over PRS performances is noteworthy, especially considering that PRS models incorporate a substantially larger number of genetic variants. This extensive variant selection can pose practical challenges in clinical settings. Additionally, the proposed approach revealed novel CAD-genetic variant associations.
摘要:
机器学习(ML)方法在全基因组关联研究中变得越来越重要,用于识别统计方法可能忽略的关键遗传变异或SNP。统计方法主要通过对单个遗传变异进行关联测试来识别具有显着效应大小的SNP,一次一个,以确定它们与目标表型的关系。然后将这些遗传变异用于创建多基因风险评分(PRS),评估个体患癌症或心血管疾病等复杂疾病的遗传风险。与传统方法不同,ML算法可以识别成组的低风险遗传变异,当组合在数学模型中时,可以提高预测准确性。然而,ML策略的应用需要解决特征选择挑战,以防止过度拟合。此外,确保ML模型依赖于一组简洁的基因组变异,增强其临床适用性,其中仅对有限数量的SNP进行测试是可行的。在这项研究中,我们引入了一个健壮的管道,将ML算法与特征选择(ML-FS算法)相结合,旨在鉴定与冠状动脉疾病(CAD)表型相关的最重要的基因组变异。所提出的计算方法在英国生物银行的个体上进行了测试,在这个广泛的队列中区分CAD和非CAD个体,并以LDpred2和Lassosum等基于PRS的标准方法为基准。我们的策略结合了交叉验证,以确保对基于基因组变异的预测模型进行更可靠的评估。这种方法通常应用于机器学习策略,但在评估多基因风险评分的预测性能的先前研究中经常被忽视。我们的结果表明,ML-FS算法可以识别具有少至50个遗传标记的面板,当与已知的风险因素结合使用时,可以达到大约80%的准确性。与PRS性能相比,准确度的适度提高是值得注意的,特别是考虑到PRS模型包含大量的遗传变异。这种广泛的变体选择可能在临床环境中带来实际挑战。此外,提出的方法揭示了新的CAD遗传变异关联。
公众号