Cheminformatics

化学信息学
  • 文章类型: Journal Article
    海洋天然产物(MNPs)继续主要在细胞毒性试验中进行测试,哺乳动物和微生物,尽管大多数在与药物发现相关的浓度下不活跃。这些MNPs成为错失的机会,代表了对宝贵生物资源的浪费。与已发表的生物活性数据一致的化学信息学的使用可以提供见解,以指导选择生物测定来评估新的MNPs。截至2023年底,在MarinLit(n=39,730)中发现的MNPs的化学信息学分析突出了吲哚-3-基-乙醛酸酰胺(IGAs,n=24)作为一组MNPs,没有报道的生物活性。然而,最近对合成IGA的评论强调了这些支架是特权结构,有几种化合物正在临床评估中。在这里,我们报告了使用简单的一锅法合成32个MNP启发的溴化IGA(25-56)库,多步法提供了对这些不同化学支架的访问。通过对海洋吲哚生物碱(MIA)和合成IGA的生物活性进行荟萃分析,研究了溴化IGA25-56对帕金森病淀粉样蛋白α突触核蛋白(α-syn)的潜在生物活性,对恶性疟原虫的氯喹抗性(3D7)和敏感(Dd2)寄生虫菌株的抗疟原虫活性,和抑制哺乳动物(胰凝乳蛋白酶和弹性蛋白酶)和病毒(SARS-CoV-23CLpro)蛋白酶。所有测试的合成IGA都表现出对淀粉样蛋白α-syn的结合亲和力,虽然一些显示出对恶性疟原虫的抑制活性,和蛋白酶,SARS-CoV-23CLpro,还有胰凝乳蛋白酶.针对癌性和非癌性人类细胞系检查了IGA的细胞安全性,所有测试的化合物都没有活性,从而验证化学信息学和荟萃分析结果。本文提出的发现扩展了我们对海洋IGA生物活性化学空间的了解,并主张扩大常规用于研究NP生物活性的生物测定的范围。特别是那些更适合无毒的化合物。通过将化学信息学工具和功能测定整合到NP生物测试工作流程中,我们的目标是增强NP及其支架的潜力,用于未来的药物发现和开发。
    Marine natural products (MNPs) continue to be tested primarily in cellular toxicity assays, both mammalian and microbial, despite most being inactive at concentrations relevant to drug discovery. These MNPs become missed opportunities and represent a wasteful use of precious bioresources. The use of cheminformatics aligned with published bioactivity data can provide insights to direct the choice of bioassays for the evaluation of new MNPs. Cheminformatics analysis of MNPs found in MarinLit (n = 39,730) up to the end of 2023 highlighted indol-3-yl-glyoxylamides (IGAs, n = 24) as a group of MNPs with no reported bioactivities. However, a recent review of synthetic IGAs highlighted these scaffolds as privileged structures with several compounds under clinical evaluation. Herein, we report the synthesis of a library of 32 MNP-inspired brominated IGAs (25-56) using a simple one-pot, multistep method affording access to these diverse chemical scaffolds. Directed by a meta-analysis of the biological activities reported for marine indole alkaloids (MIAs) and synthetic IGAs, the brominated IGAs 25-56 were examined for their potential bioactivities against the Parkinson\'s Disease amyloid protein alpha synuclein (α-syn), antiplasmodial activities against chloroquine-resistant (3D7) and sensitive (Dd2) parasite strains of Plasmodium falciparum, and inhibition of mammalian (chymotrypsin and elastase) and viral (SARS-CoV-2 3CLpro) proteases. All of the synthetic IGAs tested exhibited binding affinity to the amyloid protein α-syn, while some showed inhibitory activities against P. falciparum, and the proteases, SARS-CoV-2 3CLpro, and chymotrypsin. The cellular safety of the IGAs was examined against cancerous and non-cancerous human cell lines, with all of the compounds tested inactive, thereby validating cheminformatics and meta-analyses results. The findings presented herein expand our knowledge of marine IGA bioactive chemical space and advocate expanding the scope of biological assays routinely used to investigate NP bioactivities, specifically those more suitable for non-toxic compounds. By integrating cheminformatics tools and functional assays into NP biological testing workflows, we can aim to enhance the potential of NPs and their scaffolds for future drug discovery and development.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    吲哚胺2,3-双加氧酶(IDO)和色氨酸2,3-双加氧酶(TDO)是癌症免疫治疗的有吸引力的药物靶标。在III期临床试验中,epacadostat作为选择性IDO抑制剂的结果令人失望之后,对TDO选择性抑制剂的开发有很大的兴趣。在目前的研究中,几种数据分析方法和机器学习方法,包括逻辑回归,随机森林,使用XGBoost和支持向量机对从ChEMBL检索的化合物的数据集进行建模。基于摩根指纹的模型揭示了选择性抑制IDO的显著片段,TDO或两者。进行多片段对接以找到最佳的一组结合片段及其在空间中的方向,以实现有效的连接。通过人工智能生成框架来实现片段的连接和最终分子的优化。最后,评估优化分子的选择性,并通过PAINS过滤前4个前导分子,Brenk和NIH过滤器.结果表明苯氧酰胺,氟喹啉,和3-溴-4-氟苯胺赋予对IDO抑制的选择性。相应地,发现1-苄基-1H-萘并[2,3-d][1,2,3]三唑-4,9-二酮是通过与血红素的Fe原子构成配位键而选择性抑制TDO的完整片段。此外,发现呋喃[2,3-c]吡啶-2,3-二胺是抑制这两种靶标的常见片段,可用于设计IDO和TDO的双靶标抑制剂。此处引入的新片段可以是用于掺入选择性TDO或双重IDO/TDO抑制剂的有用结构单元。
    Indoleamine 2,3-dioxygenase (IDO) and tryptophan 2,3-dioxygenase (TDO) are attractive drug targets for cancer immunotherapy. After disappointing results of the epacadostat as a selective IDO inhibitor in phase III clinical trials, there is much interest in the development of the TDO selective inhibitors. In the current study, several data analysis methods and machine learning approaches including logistic regression, Random Forest, XGBoost and Support Vector Machines were used to model a data set of compounds retrieved from ChEMBL. Models based on the Morgan fingerprints revealed notable fragments for the selective inhibition of the IDO, TDO or both. Multiple fragment docking was performed to find the best set of bound fragments and their orientation in the space for efficient linking. Linking the fragments and optimization of the final molecules were accomplished by means of an artificial intelligence generative framework. Finally, selectivity of the optimized molecules was assessed and the top 4 lead molecules were filtered through PAINS, Brenk and NIH filters. Results indicated that phenyloxalamide, fluoroquinoline, and 3-bromo-4-fluroaniline confer selectivity towards the IDO inhibition. Correspondingly, 1-benzyl-1H-naphtho[2,3-d][1,2,3]triazole-4,9-dione was found to be an integral fragment for the selective inhibition of the TDO by constituting a coordination bond with the Fe atom of heme. In addition, furo[2,3-c]pyridine-2,3-diamine was found as a common fragment for inhibition of the both targets and can be used in the design of the dual target inhibitors of the IDO and TDO. The new fragments introduced here can be a useful building blocks for incorporation into the selective TDO or dual IDO/TDO inhibitors.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    据报道,大麻二酚与包括癫痫在内的多效性药理学的广谱生物靶标相互作用,尽管内聚机制尚未确定。尽管一些研究表明大麻二酚可以操纵谷氨酸能信号,没有足够的证据支持大麻二酚对谷氨酸信号的直接作用,这对干预癫痫很重要。因此,本研究旨在分析大麻二酚的癫痫相关靶点,评估其治疗的差异表达基因,并确定可能的谷氨酸能信号靶标。在这项研究中,使用Tanimoto系数和基于相似性指数的目标钓鱼来鉴定大麻二酚的癫痫蛋白目标,这些目标后来与改变的表达重叠,癫痫生物标志物,和基因改变的癫痫蛋白。然后用差异表达基因筛选常见蛋白的可能的谷氨酸能信号靶标。稍后,使用AutoDockVina和GROMACS进行分子对接和模拟以评估结合亲和力,配体-蛋白质稳定性,亲水相互作用,蛋白质致密性,等。大麻二酚鉴定了30种不同的癫痫相关靶标,包括G蛋白偶联受体,酶,离子通道,等。谷氨酸受体2在以大麻二酚为目标的癫痫中被鉴定为遗传变异,其表达随着其治疗而增加。更重要的是,大麻二酚显示出与谷氨酸受体2的直接结合亲和力,形成稳定的亲水相互作用和相对较低的均方根偏差和残余波动,随着广泛的构象变化,增加蛋白质的紧密度。基于化学信息学目标捕捞,差异表达基因的评估,分子对接,和模拟,可以假设大麻二酚可能具有谷氨酸受体2介导的抗癫痫活性。
    Cannabidiol has been reported to interact with broad-spectrum biological targets with pleiotropic pharmacology including epilepsy although a cohesive mechanism is yet to be determined. Even though some studies propose that cannabidiol may manipulate glutamatergic signals, there is insufficient evidence to support cannabidiol direct effect on glutamate signaling, which is important in intervening epilepsy. Therefore, the present study aimed to analyze the epilepsy-related targets for cannabidiol, assess the differentially expressed genes with its treatment, and identify the possible glutamatergic signaling target. In this study, the epileptic protein targets of cannabidiol were identified using the Tanimoto coefficient and similarity index-based targets fishing which were later overlapped with the altered expression, epileptic biomarkers, and genetically altered proteins in epilepsy. The common proteins were then screened for possible glutamatergic signaling targets with differentially expressed genes. Later, molecular docking and simulation were performed using AutoDock Vina and GROMACS to evaluate binding affinity, ligand-protein stability, hydrophilic interaction, protein compactness, etc. Cannabidiol identified 30 different epilepsy-related targets of multiple protein classes including G-protein coupled receptors, enzymes, ion channels, etc. Glutamate receptor 2 was identified to be genetically varied in epilepsy which was targeted by cannabidiol and its expression was increased with its treatment. More importantly, cannabidiol showed a direct binding affinity with Glutamate receptor 2 forming a stable hydrophilic interaction and comparatively lower root mean squared deviation and residual fluctuations, increasing protein compactness with broad conformational changes. Based on the cheminformatic target fishing, evaluation of differentially expressed genes, molecular docking, and simulations, it can be hypothesized that cannabidiol may possess glutamate receptor 2-mediated anti-epileptic activities.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    科学文献中传播的化学信息为深度学习辅助见解和突破提供了尚未开发的潜力。自动化提取工作已经从资源密集型手动提取转向应用机器学习方法来简化化学数据提取。尽管当前的提取模型和管道已经带来了显着的效率提高,他们经常表现得不高,影响在提取数据上训练的预测模型的准确性。Further,当前的化学管道既缺乏可转移性,也缺乏可扩展性,在一个任务上训练的模型可以适应另一个示例有限的相关任务,这使得能够无缝地适应新的提取任务。解决这些差距,我们介绍ChemREL,一个多功能的化学数据提取管道强调性能,可转移性,和可扩展性。ChemREL利用自定义,多样化的化学文件数据集,通过主动学习策略标记,以提取两个属性:正常熔点和致死剂量50(LD50)。正常熔点是根据其在不同背景和更广泛文献中的流行而选择的,作为管道培训的基础。相比之下,LD50评估管道对不相关属性的可转移性,强调其生物学性质的差异,毒理学背景,和单位,在其他差异中。通过预训练和微调,我们的管道优于现有方法和GPT-4,实体识别的F1分数为96.1%,关系映射的F1分数为97.0%,最终达到95.4%的整体F1评分。更重要的是,ChemREL显示高转移性,通过10个随机选择的培训文件,有效地从熔点提取过渡到LD50提取。作为开源软件包发布,ChemREL旨在扩大对化学数据提取的访问,支持构建推动发现的扩展关系数据集。
    Chemical information disseminated in scientific documents offers an untapped potential for deep learning-assisted insights and breakthroughs. Automated extraction efforts have shifted from resource-intensive manual extraction toward applying machine learning methods to streamline chemical data extraction. While current extraction models and pipelines have ushered in notable efficiency improvements, they often exhibit modest performance, compromising the accuracy of predictive models trained on extracted data. Further, current chemical pipelines lack both transferability─where a model trained on one task can be adapted to another relevant task with limited examples─and extensibility, which enables seamless adaptability for new extraction tasks. Addressing these gaps, we present ChemREL, a versatile chemical data extraction pipeline emphasizing performance, transferability, and extensibility. ChemREL utilizes a custom, diverse data set of chemical documents, labeled through an active learning strategy to extract two properties: normal melting point and lethal dose 50 (LD50). The normal melting point is selected for its prevalence in diverse contexts and wider literature, serving as the foundation for pipeline training. In contrast, LD50 evaluates the pipeline\'s transferability to an unrelated property, underscoring variance in its biological nature, toxicological context, and units, among other differences. With pretraining and fine-tuning, our pipeline outperforms existing methods and GPT-4, achieving F1-scores of 96.1% for entity identification and 97.0% for relation mapping, culminating in an overall F1-score of 95.4%. More importantly, ChemREL displays high transferability, effectively transitioning from melting point extraction to LD50 extraction with 10 randomly selected training documents. Released as an open-source package, ChemREL aims to broaden access to chemical data extraction, enabling the construction of expansive relational data sets that propel discovery.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    基于圆二色性(CD)的对映体过量(ee)测定测定法是高通量筛选(HTS)应用中色谱ee测定的光学替代方法。然而,这些测定的实施需要使用对映富集材料的校准实验。我们提出了一种数据驱动的方法,该方法避免了用于α-手性伯胺的ee测定的八面体Fe(II)络合物(1)的手性拆分和校准实验的需要。通过计算参数化分析条件中形成的亚胺配体,建立了Fe(II)组装的圆二色性(CD)响应模型。使用这个模型,生成四种分析物的校准曲线并与实验生成的曲线进行比较。在一项单盲ee测定研究中,未知样品的ee值在9%的平均绝对误差内确定,这与使用实验生成的校准曲线的误差相媲美。
    Circular dichroism (CD) based enantiomeric excess (ee) determination assays are optical alternatives to chromatographic ee determination in high-throughput screening (HTS) applications. However, the implementation of these assays requires calibration experiments using enantioenriched materials. We present a data-driven approach that circumvents the need for chiral resolution and calibration experiments for an octahedral Fe(II) complex (1) used for the ee determination of α-chiral primary amines. By computationally parameterizing the imine ligands formed in the assay conditions, a model of the circular dichroism (CD) response of the Fe(II) assembly was developed. Using this model, calibration curves were generated for four analytes and compared to experimentally generated curves. In a single-blind ee determination study, the ee values of unknown samples were determined within 9% mean absolute error, which rivals the error using experimentally generated calibration curves.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    天然产物的化合物数据库在药物发现和开发项目中起着至关重要的作用,并在其他领域具有影响。比如食品化学研究,生态学和代谢组学。最近,我们汇集了拉丁美洲天然产品数据库(LANaPDB)的第一个版本,这是来自六个国家的研究人员的集体努力,目的是在具有大量生物多样性的地理区域整合一个公共和代表性的天然产品图书馆。本工作旨在对LANaPDB的更新版本和构成LANaPDB一部分的单独的十个化合物数据库的天然产品相似度进行比较和广泛的分析。拉丁美洲化合物数据库的天然产物相似度概况与公共领域的其他主要天然产物数据库和一组批准用于临床的小分子药物的概况形成对比。作为广泛表征的一部分,我们采用了几种天然产物相似性的化学信息学指标。这项研究的结果将引起从事天然产物数据库的全球社区的关注,不仅在拉丁美洲,而且在世界各地。
    Compound databases of natural products play a crucial role in drug discovery and development projects and have implications in other areas, such as food chemical research, ecology and metabolomics. Recently, we put together the first version of the Latin American Natural Product database (LANaPDB) as a collective effort of researchers from six countries to ensemble a public and representative library of natural products in a geographical region with a large biodiversity. The present work aims to conduct a comparative and extensive profiling of the natural product-likeness of an updated version of LANaPDB and the individual ten compound databases that form part of LANaPDB. The natural product-likeness profile of the Latin American compound databases is contrasted with the profile of other major natural product databases in the public domain and a set of small-molecule drugs approved for clinical use. As part of the extensive characterization, we employed several chemoinformatics metrics of natural product likeness. The results of this study will capture the attention of the global community engaged in natural product databases, not only in Latin America but across the world.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    化学空间的探索是化学信息学的一个基本方面,特别是当人们探索一个大的化合物数据集,以将化学结构与分子性质联系起来。在这项研究中,我们在药效水平上扩展了我们以前在化学空间可视化方面的工作.而不是使用传统的亲和力二元分类(活性与非活性),我们引入了一种改进的方法,根据化合物的活性水平将其分为四个不同的类别:超活性,非常活跃,活跃,不活跃。这种分类丰富了应用于药效团空间的配色方案,其中药效团假说的颜色表示由相关化合物驱动。以BCR-ABL酪氨酸激酶为例,我们确定了与药效团活性不连续相对应的有趣区域,为结构-活动关系分析提供有价值的见解。
    The exploration of chemical space is a fundamental aspect of chemoinformatics, particularly when one explores a large compound data set to relate chemical structures with molecular properties. In this study, we extend our previous work on chemical space visualization at the pharmacophoric level. Instead of using conventional binary classification of affinity (active vs inactive), we introduce a refined approach that categorizes compounds into four distinct classes based on their activity levels: super active, very active, active, and inactive. This classification enriches the color scheme applied to pharmacophore space, where the color representation of a pharmacophore hypothesis is driven by the associated compounds. Using the BCR-ABL tyrosine kinase as a case study, we identified intriguing regions corresponding to pharmacophore activity discontinuities, providing valuable insights for structure-activity relationships analysis.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    化学信息已经变得越来越普遍,并且已经超过了分析和解释的速度。我们开发了一个R包,uafR,这可以自动进行气相色谱耦合质谱(GC-MS)数据的搜索过程,并允许对化学比较感兴趣的任何人快速执行高级结构相似性匹配。我们简化的化学信息学工作流程使具有R基本经验的任何人都可以使用已发表的对样品中分子的最佳理解(pubchem.gov)来提取成分区域以进行暂定化合物鉴定。现在可以在很短的时间内完成解释,成本,通常需要使用标准的化学生态数据分析管道。该包装在两个实验环境中进行了测试:(1)纯化的内标数据集,这表明我们的算法正确地识别了已知化合物的R2值范围为0.827-0.999,浓度范围为1×10-5至1×103ng/μl,(2)一个大的,以前发布的数据集,其中鉴定的化合物的数量和类型与传统手动峰注释过程中鉴定的化合物相当(或相同),化合物的NMDS分析产生了与原始研究相同的意义模式。使用uafR,GC-MS数据处理的速度和准确性都大大提高,因为它允许用户在试探性文库鉴定后(即在m/z光谱与已安装的化学碎片数据库(例如NIST)匹配之后)与他们的实验进行流畅地交互。使用uafR将允许快速收集和系统地解释更大的数据集。此外,uafR的功能可以允许新人员或学生在接受培训时处理以前收集和注释的积压数据。当我们进入曝光组学时代时,这一点至关重要,代谢组学,挥发物,和景观水平,高通量化学分型。该软件包旨在促进对化学数据的集体理解,适用于任何受益于GC-MS分析的研究。可以从github.org/castratton/uafR上的Github免费下载它和示例数据集,也可以使用以下开发人员工具直接从R或RStudio安装:\'devtools::install_github(\"castratton/uafR\")\'。
    Chemical information has become increasingly ubiquitous and has outstripped the pace of analysis and interpretation. We have developed an R package, uafR, that automates a grueling retrieval process for gas -chromatography coupled mass spectrometry (GC -MS) data and allows anyone interested in chemical comparisons to quickly perform advanced structural similarity matches. Our streamlined cheminformatics workflows allow anyone with basic experience in R to pull out component areas for tentative compound identifications using the best published understanding of molecules across samples (pubchem.gov). Interpretations can now be done at a fraction of the time, cost, and effort it would typically take using a standard chemical ecology data analysis pipeline. The package was tested in two experimental contexts: (1) A dataset of purified internal standards, which showed our algorithms correctly identified the known compounds with R2 values ranging from 0.827-0.999 along concentrations ranging from 1 × 10-5 to 1 × 103 ng/μl, (2) A large, previously published dataset, where the number and types of compounds identified were comparable (or identical) to those identified with the traditional manual peak annotation process, and NMDS analysis of the compounds produced the same pattern of significance as in the original study. Both the speed and accuracy of GC -MS data processing are drastically improved with uafR because it allows users to fluidly interact with their experiment following tentative library identifications [i.e. after the m/z spectra have been matched against an installed chemical fragmentation database (e.g. NIST)]. Use of uafR will allow larger datasets to be collected and systematically interpreted quickly. Furthermore, the functions of uafR could allow backlogs of previously collected and annotated data to be processed by new personnel or students as they are being trained. This is critical as we enter the era of exposomics, metabolomics, volatilomes, and landscape level, high-throughput chemotyping. This package was developed to advance collective understanding of chemical data and is applicable to any research that benefits from GC -MS analysis. It can be downloaded for free along with sample datasets from Github at github.org/castratton/uafR or installed directly from R or RStudio using the developer tools: \'devtools::install_github(\"castratton/uafR\")\'.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    现代医学中最具挑战性的任务之一是寻找具有最小副作用的新型有效癌症治疗方法。最近发现的几类有机分子被称为“分子手提钻”是在这个方向上的一个有希望的发展。已知这些分子可以直接靶向和消除癌细胞而对健康组织没有影响。然而,潜在的微观图片仍然知之甚少。我们提出了一项研究,该研究利用理论分析和实验测量来阐明手提钻抗癌活性的微观方面。我们的物理化学方法将统计分析与化学信息学方法相结合,以设计和优化分子手提钻。通过将这些分子的特定物理化学性质与它们杀死癌细胞的能力相关联,确定并讨论了几个重要的结构特征。尽管我们的理论分析增强了对手提钻分子相互作用的理解,它还强调了需要进一步研究以全面阐明其机制,并为合理设计靶向抗癌药物开发强大的物理化学框架。
    One of the most challenging tasks in modern medicine is to find novel efficient cancer therapeutic methods with minimal side effects. The recent discovery of several classes of organic molecules known as \"molecular jackhammers\" is a promising development in this direction. It is known that these molecules can directly target and eliminate cancer cells with no impact on healthy tissues. However, the underlying microscopic picture remains poorly understood. We present a study that utilizes theoretical analysis together with experimental measurements to clarify the microscopic aspects of jackhammers\' anticancer activities. Our physical-chemical approach combines statistical analysis with chemoinformatics methods to design and optimize molecular jackhammers. By correlating specific physical-chemical properties of these molecules with their abilities to kill cancer cells, several important structural features are identified and discussed. Although our theoretical analysis enhances understanding of the molecular interactions of jackhammers, it also highlights the need for further research to comprehensively elucidate their mechanisms and to develop a robust physical-chemical framework for the rational design of targeted anticancer drugs.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    从化学文献中提取信息对于为数据驱动的化学构建最新的反应数据库至关重要。完整的提取需要跨文本组合信息,tables,和数字,而先前的工作主要研究了从单一模式中提取反应。在本文中,我们提出OpenChemIE来解决这一复杂的挑战,并在文档级别实现反应数据的提取。OpenChemIE通过两个步骤来解决这个问题:从各个模态中提取相关信息,然后整合结果以获得最终的反应列表。第一步,我们使用专门的神经模型来处理化学信息提取的特定任务,例如从文本或图形解析分子或反应。然后,我们使用化学信息算法整合来自这些模块的信息,允许从反应条件和底物范围调查中提取细粒反应数据。我们的机器学习模型在单独评估时获得了最先进的性能,我们精心注释了一个具有R-group的具有挑战性的反应方案数据集,以评估我们的管道作为一个整体,F1得分为69.5%。此外,当直接与Reaxys化学数据库比较时,OpenChemIE的反应提取结果获得了64.3%的准确率。OpenChemIE最适用于有机化学文献的信息提取,其中分子通常被描绘为平面图或以文本书写,并且可以合并为SMILES格式。我们将OpenChemIE作为开源软件包免费提供给公众,以及通过Web界面。
    Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号