关键词: domain adaptation phenotype prediction semi-supervised underrepresented population

来  源:   DOI:10.1101/2023.10.10.561715   PDF(Pubmed)

Abstract:
The lack of diversity in genomic datasets, currently skewed towards individuals of European ancestry, presents a challenge in developing inclusive biomedical models. The scarcity of such data is particularly evident in labeled datasets that include genomic data linked to electronic health records. To address this gap, this paper presents PopGenAdapt, a genotype-to-phenotype prediction model which adopts semi-supervised domain adaptation (SSDA) techniques originally proposed for computer vision. PopGenAdapt is designed to leverage the substantial labeled data available from individuals of European ancestry, as well as the limited labeled and the larger amount of unlabeled data from currently underrepresented populations. The method is evaluated in underrepresented populations from Nigeria, Sri Lanka, and Hawaii for the prediction of several disease outcomes. The results suggest a significant improvement in the performance of genotype-to-phenotype models for these populations over state-of-the-art supervised learning methods, setting SSDA as a promising strategy for creating more inclusive machine learning models in biomedical research.
摘要:
基因组数据集缺乏多样性,目前偏向欧洲血统的个人,提出了开发包容性生物医学模型的挑战。这种数据的稀缺性在包括与电子健康记录相关的基因组数据的标记数据集中尤其明显。为了解决这个差距,本文介绍了PopGenAdapt,基因型到表型预测模型,采用最初为计算机视觉提出的半监督域适应(SSDA)技术。PopGenAdapt旨在利用来自欧洲血统个人的大量标记数据,以及来自当前代表性不足的人群的有限标记数据和大量未标记数据。该方法是在来自尼日利亚的代表性不足的人群中进行评估的,斯里兰卡,和夏威夷对几种疾病结果的预测。结果表明,与最先进的监督学习方法相比,这些人群的基因型到表型模型的性能有了显着改善,将SSDA设置为在生物医学研究中创建更具包容性的机器学习模型的有前途的策略。我们的代码可在https://github.com/AI-sandbox/PopGenAdapt上获得。
公众号