参与早期药物发现的开放科学组织的数据科学路线图。A data science roadmap for open science organizations engaged in early-stage drug discovery.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.

摘要：

结构基因组学联盟是一个国际开放的科学研究组织，专注于加速早期药物发现，即命中发现和优化。我们,和其他许多人一样，相信人工智能(AI)有望成为该领域的主要加速器。问题是如何从人工智能的最新进展中获得最大利益，以及如何产生，格式化和传播数据，以实现人工智能指导药物发现的未来突破。我们在此介绍由公共和私营部门专家组成的工作组的建议。强大的数据管理需要精确的本体和标准化的词汇，而跨实验室的集中式数据库架构有助于将数据集成到高价值数据集。实验室自动化和开放的电子实验室笔记本以数据挖掘推动了数据共享和数据建模的边界。构建健壮的机器学习模型的重要考虑因素包括透明和可重复的数据处理。选择最相关的数据表示，定义正确的训练和测试集，并估计预测不确定性。除了数据共享，可以利用基于云的计算来构建和传播机器学习模型。命中和化学探针发现的重要加速度向量将是（1）在设计-制造-测试-分析（DMTA）循环中公开实时集成实验数据生成和建模工作流程，和规模；(2)采用数据科学家和实验主义者作为一个统一团队工作的心态，数据科学被纳入实验设计。