phenotype prediction

表型预测
  • 文章类型: Journal Article
    来自全基因组关联研究的估计多基因评分(PGSs)的方法越来越多地被利用。然而,缺乏独立的方法评估,和方法比较往往是有限的。这里,我们评估了通过5项生物库研究(总计约120万参与者)中的7种方法得出的16种疾病和数量性状的多基因评分,建立在参考标准化框架上。我们进行了荟萃分析,以量化方法选择的影响,超参数调整,方法集成,和PGS性能的目标生物库。我们发现,没有一种方法能始终如一地胜过所有其他方法。当方法被很好地调整时,PGS效应大小在生物库之间比在生物库内的方法之间更可变。两种研究的自身免疫性疾病的方法之间的差异最大,血清阳性类风湿性关节炎和1型糖尿病。对于大多数方法,对于超参数调整,交叉验证比自动调整(不使用目标数据)更可靠.对于给定的目标表型,在UKBiobank中调谐的跨方法(集合PGS)组合PGS的弹性网络模型提供了一致的,高,和跨生物库可转移性能,将PGS效应大小(β系数)相对于LDpred2和MegaPRS(当通过交叉验证调整时,两种性能最佳的单一方法)的中位数增加5.0%。我们的可交互浏览的在线结果和开源工作流程prspipe为跨生物库的多基因评分方法的分析提供了丰富的资源和参考。
    Methods of estimating polygenic scores (PGSs) from genome-wide association studies are increasingly utilized. However, independent method evaluation is lacking, and method comparisons are often limited. Here, we evaluate polygenic scores derived via seven methods in five biobank studies (totaling about 1.2 million participants) across 16 diseases and quantitative traits, building on a reference-standardized framework. We conducted meta-analyses to quantify the effects of method choice, hyperparameter tuning, method ensembling, and the target biobank on PGS performance. We found that no single method consistently outperformed all others. PGS effect sizes were more variable between biobanks than between methods within biobanks when methods were well tuned. Differences between methods were largest for the two investigated autoimmune diseases, seropositive rheumatoid arthritis and type 1 diabetes. For most methods, cross-validation was more reliable for tuning hyperparameters than automatic tuning (without the use of target data). For a given target phenotype, elastic net models combining PGS across methods (ensemble PGS) tuned in the UK Biobank provided consistent, high, and cross-biobank transferable performance, increasing PGS effect sizes (β coefficients) by a median of 5.0% relative to LDpred2 and MegaPRS (the two best-performing single methods when tuned with cross-validation). Our interactively browsable online-results and open-source workflow prspipe provide a rich resource and reference for the analysis of polygenic scoring methods across biobanks.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    基因型到表型作图是当前基因组时代的基本问题。虽然定性病例对照预测受到了极大的关注,较少强调预测定量表型。这个新兴领域在揭示微生物群落与宿主健康之间的复杂联系方面具有巨大的前景。然而,微生物组数据集异质性的存在对预测的准确性提出了重大挑战,并削弱了模型的可重复性.为了应对这一挑战,我们调查了22种标准化方法,旨在消除多个数据集的异质性,对它们进行了全面审查,并评估了它们在三个模拟场景和31个真实数据集中预测定量表型的有效性。结果表明,这些方法中没有一种在预测定量表型方面表现出明显的优势,或者在预测的均方根误差(RMSE)方面显着降低。鉴于批量效应的频繁发生以及批量校正方法在预测受这些效应影响的数据集时的令人满意的性能,我们强烈建议使用批量校正方法作为预测定量表型的第一步.总之,标准化方法在预测宏基因组数据中的表现仍然是一个动态和持续的研究领域。我们的研究通过对各种方法进行全面评估并为预测定量表型的有效性提供有价值的见解,从而为这一领域做出了贡献。
    Genotype-to-phenotype mapping is an essential problem in the current genomic era. While qualitative case-control predictions have received significant attention, less emphasis has been placed on predicting quantitative phenotypes. This emerging field holds great promise in revealing intricate connections between microbial communities and host health. However, the presence of heterogeneity in microbiome datasets poses a substantial challenge to the accuracy of predictions and undermines the reproducibility of models. To tackle this challenge, we investigated 22 normalization methods that aimed at removing heterogeneity across multiple datasets, conducted a comprehensive review of them, and evaluated their effectiveness in predicting quantitative phenotypes in three simulation scenarios and 31 real datasets. The results indicate that none of these methods demonstrate significant superiority in predicting quantitative phenotypes or attain a noteworthy reduction in Root Mean Squared Error (RMSE) of the predictions. Given the frequent occurrence of batch effects and the satisfactory performance of batch correction methods in predicting datasets affected by these effects, we strongly recommend utilizing batch correction methods as the initial step in predicting quantitative phenotypes. In summary, the performance of normalization methods in predicting metagenomic data remains a dynamic and ongoing research area. Our study contributes to this field by undertaking a comprehensive evaluation of diverse methods and offering valuable insights into their effectiveness in predicting quantitative phenotypes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    快速全基因组测序的出现为从基因组数据计算预测抗菌素耐药性(AMR)表型创造了新的机会。基于规则的方法和机器学习(ML)方法都已经被探索用于这项任务,但是仍然需要系统的基准测试。这里,我们评估了四种最先进的ML方法(Kover,PhenotypeSeeker,Seq2Geno2Pheno和Aytan-Aktug),ML基线和基于规则的ResFinder,通过在78个物种抗生素数据集中对它们进行培训和测试,使用严格的基准工作流程,集成了三种评估方法,每个配对三种不同的样品分割方法。我们的分析显示,技术和数据集之间的性能差异很大。而ML方法通常优于密切相关的菌株,ResFinder擅长处理不同的基因组。总的来说,Kover最常在ML方法中排名第一,其次是PhenotypeSeeker和Seq2Geno2Pheno。预测了抗生素类的AMR表型,例如大环内酯类和磺胺类。不同物种-抗生素组合的预测质量差异很大,特别是β-内酰胺类;跨物种,β-内酰胺类化合物的抗性表型,氨曲南,阿莫西林/克拉维酸,头孢西丁,头孢他啶和哌拉西林/他唑巴坦,与其他基准抗生素相比,四环素类药物表现出更多的可变性能。按有机体,空肠弯曲菌和屎肠球菌的表型比大肠杆菌的预测更为稳健,金黄色葡萄球菌,肠沙门氏菌,淋病奈瑟菌,肺炎克雷伯菌,铜绿假单胞菌,鲍曼不动杆菌,肺炎链球菌和结核分枝杆菌。此外,我们的研究为每个物种-抗生素组合提供了软件建议.它进一步强调了对稳健临床应用的优化需求,特别是对于与用于训练的菌株大不相同的菌株。
    The advent of rapid whole-genome sequencing has created new opportunities for computational prediction of antimicrobial resistance (AMR) phenotypes from genomic data. Both rule-based and machine learning (ML) approaches have been explored for this task, but systematic benchmarking is still needed. Here, we evaluated four state-of-the-art ML methods (Kover, PhenotypeSeeker, Seq2Geno2Pheno and Aytan-Aktug), an ML baseline and the rule-based ResFinder by training and testing each of them across 78 species-antibiotic datasets, using a rigorous benchmarking workflow that integrates three evaluation approaches, each paired with three distinct sample splitting methods. Our analysis revealed considerable variation in the performance across techniques and datasets. Whereas ML methods generally excelled for closely related strains, ResFinder excelled for handling divergent genomes. Overall, Kover most frequently ranked top among the ML approaches, followed by PhenotypeSeeker and Seq2Geno2Pheno. AMR phenotypes for antibiotic classes such as macrolides and sulfonamides were predicted with the highest accuracies. The quality of predictions varied substantially across species-antibiotic combinations, particularly for beta-lactams; across species, resistance phenotyping of the beta-lactams compound, aztreonam, amoxicillin/clavulanic acid, cefoxitin, ceftazidime and piperacillin/tazobactam, alongside tetracyclines demonstrated more variable performance than the other benchmarked antibiotics. By organism, Campylobacter jejuni and Enterococcus faecium phenotypes were more robustly predicted than those of Escherichia coli, Staphylococcus aureus, Salmonella enterica, Neisseria gonorrhoeae, Klebsiella pneumoniae, Pseudomonas aeruginosa, Acinetobacter baumannii, Streptococcus pneumoniae and Mycobacterium tuberculosis. In addition, our study provides software recommendations for each species-antibiotic combination. It furthermore highlights the need for optimization for robust clinical applications, particularly for strains that diverge substantially from those used for training.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    尽管蛋白质序列编码折叠和功能的信息,了解他们的联系并不是一件容易的事。不幸的是,对特定氨基酸如何影响这些特征的预测仍然受到很大影响。这里,我们开发了一种简单的算法,可以在蛋白质序列中找到具有调节研究的定量表型的潜力的位置。从几百个蛋白质序列中,我们执行多个序列比对,获得序列和观察到的表型的每个位置成对差异,并计算最后两个量之间的相关性。我们用四种情况测试了我们的方法:古细菌腺苷酸激酶和生物体的最佳生长温度,微生物视紫红质及其最大吸收波长,哺乳动物肌动蛋白和它们的肌肉浓度,以及两种不同分子对HIV蛋白酶临床分离株的抑制作用。我们发现从3到10个位置与这些表型密切相关,取决于所研究的案例。我们表明,这些相关性使用单个位置出现,但是当联合分析最相关的位置时,可以实现改进。值得注意的是,我们使用一个简单的线性模型进行表型预测,该模型将每个位置的差异和观察到的表型的差异联系起来.预测与最先进的方法相当,在大多数情况下,要复杂得多。所有计算都以非常低的信息成本获得,因为所需的唯一输入是蛋白质序列与其相关定量表型的多序列比对。所探索的系统的多样性使我们的工作成为找到生物活性调节的序列决定因素并预测蛋白质家族未表征成员的各种功能特征的有价值的工具。
    Although protein sequences encode the information for folding and function, understanding their link is not an easy task. Unluckily, the prediction of how specific amino acids contribute to these features is still considerably impaired. Here, we developed a simple algorithm that finds positions in a protein sequence with potential to modulate the studied quantitative phenotypes. From a few hundred protein sequences, we perform multiple sequence alignments, obtain the per-position pairwise differences for both the sequence and the observed phenotypes, and calculate the correlation between these last two quantities. We tested our methodology with four cases: archaeal Adenylate Kinases and the organisms optimal growth temperatures, microbial rhodopsins and their maximal absorption wavelengths, mammalian myoglobins and their muscular concentration, and inhibition of HIV protease clinical isolates by two different molecules. We found from 3 to 10 positions tightly associated with those phenotypes, depending on the studied case. We showed that these correlations appear using individual positions but an improvement is achieved when the most correlated positions are jointly analyzed. Noteworthy, we performed phenotype predictions using a simple linear model that links per-position divergences and differences in the observed phenotypes. Predictions are comparable to the state-of-art methodologies which, in most of the cases, are far more complex. All of the calculations are obtained at a very low information cost since the only input needed is a multiple sequence alignment of protein sequences with their associated quantitative phenotypes. The diversity of the explored systems makes our work a valuable tool to find sequence determinants of biological activity modulation and to predict various functional features for uncharacterized members of a protein family.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    组合来自多个来源的训练数据增加了样本量并减少了混淆,导致更准确、更少偏见的机器学习模型。在医疗保健方面,然而,数据保管人通常不允许直接汇集数据,他们负责将敏感信息的暴露降至最低。联合学习通过以分散的方式训练模型,从而降低数据泄漏的风险,为这个问题提供了一个有前途的解决方案。尽管联合学习对临床数据的利用越来越多,尚未研究其对个体水平基因组数据的功效。本研究通过调查两种情况下的适用性,为采用联合学习基因组数据奠定了基础:英国生物库数据的表型预测和1000基因组项目数据的祖先预测。我们表明,在分成独立节点的数据上训练的联邦模型实现了接近集中式模型的性能,即使存在显著的节点间异质性。此外,我们研究了联邦模型精度如何受到通信频率的影响,并提出了降低计算复杂性或通信成本的方法。
    Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    为了探索通过人工智能驱动的表型预测专家系统推进数字育种实践的强大工具,我们对11个非线性回归模型进行了全面分析。我们的调查特别强调了支持向量回归(SVR)和SHapley添加剂扩张(SHAP)在预测大豆分枝中的重要性。通过使用1918年大豆种质的分支数据(表型)和42kSNP(单核苷酸多态性)多态性数据(基因型),本研究系统地比较了11个非线性回归人工智能模型,包括四个深度学习模型(DBN(深度信念网络)回归,人工神经网络(ANN)回归,自编码器回归,和MLP(多层感知器)回归)和七个机器学习模型(例如,SVR(支持向量回归),XGBoost(极限梯度提升)回归,随机森林回归,LightGBM回归,GP(高斯过程)回归,决策树回归,和多项式回归)。在通过四个估值指标进行评估后:R2(R平方),MAE(平均绝对误差),MSE(均方误差),和MAPE(平均绝对百分比误差),发现SVR,多项式回归,DBN,和Autoencoder优于其他模型,并且在用于表型预测时可以获得更好的预测精度。在对深度学习方法的评估中,我们以SVR模型为例,对特征重要性和基因本体论(GO)富集进行分析,提供全面支持。在综合比较了四种特征重要性算法后,在四种算法的特征重要性排名得分中没有观察到明显的区别,即变量排名,排列,SHAP,和相关矩阵,但是SHAP值可以提供负面贡献基因的丰富信息,选择SHAP重要性进行特征选择。这项研究的结果为人工智能介导的植物育种提供了有价值的见解,解决传统育种计划面临的挑战。所开发的方法在表型预测中具有广泛的适用性,次要QTL(数量性状位点)挖掘,和植物智能育种系统,为基于AI的育种实践的进步做出了重大贡献,并从基于经验的育种过渡到基于数据的育种。
    To explore a robust tool for advancing digital breeding practices through an artificial intelligence-driven phenotype prediction expert system, we undertook a thorough analysis of 11 non-linear regression models. Our investigation specifically emphasized the significance of Support Vector Regression (SVR) and SHapley Additive exPlanations (SHAP) in predicting soybean branching. By using branching data (phenotype) of 1918 soybean accessions and 42 k SNP (Single Nucleotide Polymorphism) polymorphic data (genotype), this study systematically compared 11 non-linear regression AI models, including four deep learning models (DBN (deep belief network) regression, ANN (artificial neural network) regression, Autoencoders regression, and MLP (multilayer perceptron) regression) and seven machine learning models (e.g., SVR (support vector regression), XGBoost (eXtreme Gradient Boosting) regression, Random Forest regression, LightGBM regression, GPs (Gaussian processes) regression, Decision Tree regression, and Polynomial regression). After being evaluated by four valuation metrics: R2 (R-squared), MAE (Mean Absolute Error), MSE (Mean Squared Error), and MAPE (Mean Absolute Percentage Error), it was found that the SVR, Polynomial Regression, DBN, and Autoencoder outperformed other models and could obtain a better prediction accuracy when they were used for phenotype prediction. In the assessment of deep learning approaches, we exemplified the SVR model, conducting analyses on feature importance and gene ontology (GO) enrichment to provide comprehensive support. After comprehensively comparing four feature importance algorithms, no notable distinction was observed in the feature importance ranking scores across the four algorithms, namely Variable Ranking, Permutation, SHAP, and Correlation Matrix, but the SHAP value could provide rich information on genes with negative contributions, and SHAP importance was chosen for feature selection. The results of this study offer valuable insights into AI-mediated plant breeding, addressing challenges faced by traditional breeding programs. The method developed has broad applicability in phenotype prediction, minor QTL (quantitative trait loci) mining, and plant smart-breeding systems, contributing significantly to the advancement of AI-based breeding practices and transitioning from experience-based to data-based breeding.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Letter
    近几十年来,早产(PTB)已成为医疗保健领域的重要研究热点,因为它是全球新生儿死亡的主要原因。使用五个独立的研究队列,包括来自561名孕妇的1290个阴道样本,这些孕妇在足月分娩(n=1029)或过早分娩(n=261),我们分析了阴道宏基因组学数据,以获得精确的微生物组结构表征.然后,训练了深度神经网络(DNN)来预测足月出生(TB)和PTB,准确率为84.10%,受试者工作特征曲线下面积(AUROC)为0.875±0.11.在基准测试过程中,我们证明了我们的DL模型优于目前使用的7种机器学习算法.最后,我们的结果表明,在预测PTB时,应考虑阴道微生物群的总体多样性,而非特定物种.这种基于人工智能的策略应该对临床医生预测早产风险非常有帮助。允许个性化援助来解决各种健康问题。DeepMPTB是开源的,免费供学术使用。它根据GNUAffero通用公共许可证3.0获得许可,可在https://deepmptb上获得。流光。app/.源代码可在https://github.com/oschakoory/DeepMPTB上获得,可以使用Docker轻松安装(https://www。docker.com/)。
    In recent decades, preterm birth (PTB) has become a significant research focus in the healthcare field, as it is a leading cause of neonatal mortality worldwide. Using five independent study cohorts including 1290 vaginal samples from 561 pregnant women who delivered at term (n = 1029) or prematurely (n = 261), we analysed vaginal metagenomics data for precise microbiome structure characterization. Then, a deep neural network (DNN) was trained to predict term birth (TB) and PTB with an accuracy of 84.10% and an area under the receiver operating characteristic curve (AUROC) of 0.875 ± 0.11. During a benchmarking process, we demonstrated that our DL model outperformed seven currently used machine learning algorithms. Finally, our results indicate that overall diversity of the vaginal microbiota should be taken in account to predict PTB and not specific species. This artificial-intelligence based strategy should be highly helpful for clinicians in predicting preterm birth risk, allowing personalized assistance to address various health issues. DeepMPTB is open source and free for academic use. It is licensed under a GNU Affero General Public License 3.0 and is available at https://deepmptb.streamlit.app/ . Source code is available at https://github.com/oschakoory/DeepMPTB and can be easily installed using Docker ( https://www.docker.com/ ).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    深度学习方法的最新发展无疑带来了各种机器学习任务的极大改进。尤其是在预测任务中。这类方法也适用于回答生物信息学中的各种问题,包括自动基因组注释,人工基因组生成或表型预测。特别是,一种特定类型的深度学习方法,称为图神经网络(GNN)已被反复报道为从基因表达中预测表型的良好候选者,因为它能够通过使用基因网络嵌入基因调控或共表达的信息。然而,到目前为止,与更标准(和更简单)的机器学习方法相比,尚未执行完整且可重复的基准来分析这种方法的成本和收益之间的权衡。在这篇文章中,我们提供了这样一个基准,基于明确和可比的政策,在几个数据集上评估不同的方法。我们的结论是,GNN很少在预测性能上提供真正的改进,特别是当与方法所需的计算工作量相比时。我们在有限但受控的模拟数据集上的发现表明,这可以通过输入生物基因网络本身的有限质量或预测能力来解释。
    The recent development of deep learning methods have undoubtedly led to great improvement in various machine learning tasks, especially in prediction tasks. This type of methods have also been adapted to answer various problems in bioinformatics, including automatic genome annotation, artificial genome generation or phenotype prediction. In particular, a specific type of deep learning method, called graph neural network (GNN) has repeatedly been reported as a good candidate to predict phenotypes from gene expression because its ability to embed information on gene regulation or co-expression through the use of a gene network. However, up to date, no complete and reproducible benchmark has ever been performed to analyze the trade-off between cost and benefit of this approach compared to more standard (and simpler) machine learning methods. In this article, we provide such a benchmark, based on clear and comparable policies to evaluate the different methods on several datasets. Our conclusion is that GNN rarely provides a real improvement in prediction performance, especially when compared to the computation effort required by the methods. Our findings on a limited but controlled simulated dataset shows that this could be explained by the limited quality or predictive power of the input biological gene network itself.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    法医昆虫学证据用于估计最小死后间隔(PMImin),location,以及识别苍蝇样本或人类遗骸。传统的法医DNA分析(即,STR,线粒体DNA)已用于从幼虫肠道内容物中鉴定人类。法医DNA表型(FDP),从基于DNA的犯罪现场证据预测人类的外观,在过去的几年中已经成为法医遗传学的一种既定方法。在这项研究中,我们旨在从Luciliasericata(Meigen1826)(Diptera:Calliphoridae)肠道内容物中恢复人类DNA,并使用HIrisPlex系统预测个体的眼睛和头发颜色。从30名接受the清创术治疗的人类志愿者中收集了丝绒夜蛾幼虫和参考血液样本。从作物内容物中提取人DNA并定量。使用SNaPshot微测序程序进行HIrisPlex多重分析。HIrisPlex在线工具用于评估幼虫和参考样品的眼睛和头发颜色的预测。我们成功地对30个幼虫样本中的25个进行了基因分型,大多数SNP基因型(87.13%)与参考样本相匹配,尽管一些等位基因被删除了,产生部分轮廓。在25个幼虫样本中的17个中,眼睛颜色的预测是准确的,只有一个样本被错误分类。25个幼虫样本中有14个正确预测了头发的颜色,和八个错误分类。这项研究表明,sericata肠道内容物的SNP分析可用于预测尸体的眼睛和头发颜色。
    Forensic entomological evidence is employed to estimate minimum postmortem interval (PMImin), location, and identification of fly samples or human remains. Traditional forensic DNA analysis (i.e., STR, mitochondrial DNA) has been used for human identification from the larval gut contents. Forensic DNA phenotyping (FDP), predicting human appearance from DNA-based crime scene evidence, has become an established approach in forensic genetics in the past years. In this study, we aimed to recover human DNA from Lucilia sericata (Meigen 1826) (Diptera: Calliphoridae) gut contents and predict the eye and hair color of individuals using the HIrisPlex system. Lucilia sericata larvae and reference blood samples were collected from 30 human volunteers who were under maggot debridement therapy. The human DNA was extracted from the crop contents and quantified. HIrisPlex multiplex analysis was performed using the SNaPshot minisequencing procedure. The HIrisPlex online tool was used to assess the prediction of the eye and hair color of the larval and reference samples. We successfully genotyped 25 out of 30 larval samples, and the most SNP genotypes (87.13%) matched those of reference samples, though some alleles were dropped out, producing partial profiles. The prediction of the eye colors was accurate in 17 out of 25 larval samples, and only one sample was misclassified. Fourteen out of 25 larval samples were correctly predicted for hair color, and eight were misclassified. This study shows that SNP analysis of L. sericata gut contents can be used to predict eye and hair color of a corpse.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Preprint
    由于该组在基因组数据集和构建模型的大规模生物库中的过度代表,因此精确医学模型通常对欧洲血统的人群表现更好。因此,预测模型可能会对代表性不足的人群进行错误陈述或提供不太准确的治疗建议,造成健康差异。这项研究引入了一种适应性强的机器学习工具包,该工具包集成了多种现有方法和新技术,以提高基因组数据集中代表性不足人群的预测准确性。通过利用机器学习技术,包括梯度提升和自动化方法,再加上新的总体条件重采样技术,我们的方法显著改善了不同人群单核苷酸多态性(SNP)数据的表型预测.我们使用英国生物库评估我们的方法,主要由具有欧洲血统的英国人组成,以及具有亚洲和非洲血统的少数群体。绩效指标表明,对于代表性不足的群体,表型预测有了实质性改善,实现与多数组相当的预测精度。这种方法代表了在当前数据集多样性挑战中提高预测准确性的重要一步。通过整合量身定制的管道,我们的方法促进了统计遗传学方法更公平的有效性和实用性,为更具包容性的模式和成果铺平道路。
    Precision medicine models often perform better for populations of European ancestry due to the over-representation of this group in the genomic datasets and large-scale biobanks from which the models are constructed. As a result, prediction models may misrepresent or provide less accurate treatment recommendations for underrepresented populations, contributing to health disparities. This study introduces an adaptable machine learning toolkit that integrates multiple existing methodologies and novel techniques to enhance the prediction accuracy for underrepresented populations in genomic datasets. By leveraging machine learning techniques, including gradient boosting and automated methods, coupled with novel population-conditional re-sampling techniques, our method significantly improves the phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse populations. We evaluate our approach using the UK Biobank, which is composed primarily of British individuals with European ancestry, and a minority representation of groups with Asian and African ancestry. Performance metrics demonstrate substantial improvements in phenotype prediction for underrepresented groups, achieving prediction accuracy comparable to that of the majority group. This approach represents a significant step towards improving prediction accuracy amidst current dataset diversity challenges. By integrating a tailored pipeline, our approach fosters more equitable validity and utility of statistical genetics methods, paving the way for more inclusive models and outcomes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号