关键词: deep learning drug response prediction drug sensitivity precision oncology

Mesh : Algorithms Cell Line Humans Machine Learning Neoplasms / drug therapy genetics Neural Networks, Computer

来  源:   DOI:10.1093/bib/bbab356   PDF(Pubmed)

Abstract:
To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross-validation within a single study to assess model accuracy. While an essential first step, cross-validation within a biological data set typically provides an overly optimistic estimate of the prediction performance on independent test sets. To provide a more rigorous assessment of model generalizability between different studies, we use machine learning to analyze five publicly available cell line-based data sets: National Cancer Institute 60, ancer Therapeutics Response Portal (CTRP), Genomics of Drug Sensitivity in Cancer, Cancer Cell Line Encyclopedia and Genentech Cell Line Screening Initiative (gCSI). Based on observed experimental variability across studies, we explore estimates of prediction upper bounds. We report performance results of a variety of machine learning models, with a multitasking deep neural network achieving the best cross-study generalizability. By multiple measures, models trained on CTRP yield the most accurate predictions on the remaining testing data, and gCSI is the most predictable among the cell line data sets included in this study. With these experiments and further simulations on partial data, two lessons emerge: (1) differences in viability assays can limit model generalizability across studies and (2) drug diversity, more than tumor diversity, is crucial for raising model generalizability in preclinical screening.
摘要:
为了实现个性化的癌症治疗,已经开发了机器学习模型来预测药物反应作为肿瘤和药物特征的函数。然而,大多数算法开发工作都依赖于一项研究中的交叉验证来评估模型准确性.虽然是必不可少的第一步,生物数据集中的交叉验证通常提供了对独立测试集的预测性能的过度乐观估计。为了对不同研究之间的模型泛化性进行更严格的评估,我们使用机器学习来分析五个公开可用的基于细胞系的数据集:国家癌症研究所60,癌症治疗反应门户(CTRP),癌症药物敏感性基因组学,癌细胞系百科全书和基因技术细胞系筛选计划(gCSI)。根据观察到的不同研究的实验差异,我们探索预测上限的估计。我们报告了各种机器学习模型的性能结果,通过多任务深度神经网络实现最佳的交叉研究泛化。通过多种措施,在CTRP上训练的模型对剩余的测试数据产生最准确的预测,gCSI是本研究中包含的细胞系数据集中最可预测的。通过这些实验和对部分数据的进一步模拟,出现两个教训:(1)活力测定的差异可能会限制模型在研究中的通用性;(2)药物多样性,不仅仅是肿瘤的多样性,对于提高临床前筛查中的模型普适性至关重要。
公众号