Chemical space

化学空间
  • 文章类型: Journal Article
    为了扩大数学化学的领域,并激发超越图论和量子化学领域的研究,我们探索了五个数学化学空间及其相互联系。这些空间构成了化学空间,包括物质和反应;反应条件的空间,跨越化学反应中涉及的物理和化学方面;反应语法的空间,它包含了创造和打破化学键的规则;物质属性的空间,涵盖所有关于物质的记录测量;以及物质表示的空间,由表征物质的各种本体论组成。
    In an effort to expand the domain of mathematical chemistry and inspire research beyond the realms of graph theory and quantum chemistry, we explore five mathematical chemistry spaces and their interconnectedness. These spaces comprise the chemical space, which encompasses substances and reactions; the space of reaction conditions, spanning the physical and chemical aspects involved in chemical reactions; the space of reaction grammars, which encapsulates the rules for creating and breaking chemical bonds; the space of substance properties, covering all documented measurements regarding substances; and the space of substance representations, composed of the various ontologies for characterising substances.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    机器学习(ML)技术在化学应用中的广泛使用伴随着分析极大分子库的迫切需要。特别是,聚类仍然是剖析化学空间的最常见的工具之一。不幸的是,大多数当前的方法都存在不利的时间和内存缩放,这使得它们不适合处理百万和十亿大小的电视机。这里,我们建议用一种时间和内存高效的聚类算法来绕过这些问题,BitBIRCH.该方法使用类似于在使用层次(BIRCH)算法的平衡迭代缩减和聚类中找到的树结构,以确保O(N)时间缩放。BitBIRCH利用即时相似性(iSIM)形式主义来处理二进制指纹,允许使用Tanimoto相似性,并降低内存需求。我们的测试表明,BitBIRCH已经比具有1,500,000个分子的库的Taylor-Butina聚类的标准实现快1000倍。BitBIRCH在不影响所得群集质量的情况下提高效率。我们探索处理大型集合的策略,我们使用并行/迭代BitBIRCH近似在5小时内应用于10亿个分子的聚类。
    The widespread use of Machine Learning (ML) techniques in chemical applications has come with the pressing need to analyze extremely large molecular libraries. In particular, clustering remains one of the most common tools to dissect the chemical space. Unfortunately, most current approaches present unfavorable time and memory scaling, which makes them unsuitable to handle million- and billion-sized sets. Here, we propose to bypass these problems with a time- and memory-efficient clustering algorithm, BitBIRCH. This method uses a tree structure similar to the one found in the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm to ensure O N time scaling. BitBIRCH leverages the instant similarity (iSIM) formalism to process binary fingerprints, allowing the use of Tanimoto similarity, and reducing memory requirements. Our tests show that BitBIRCH is already > 1,000 times faster than standard implementations of the Taylor-Butina clustering for libraries with 1,500,000 molecules. BitBIRCH increases efficiency without compromising the quality of the resulting clusters. We explore strategies to handle large sets, which we applied in the clustering of one billion molecules under 5 hours using a parallel/iterative BitBIRCH approximation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    蛋白质合成方法已经适应于掺入不断增长水平的非天然组分。同时,从头设计蛋白质结构和功能已经迅速成为一种可行的能力。然而,这两个令人兴奋的趋势尚未以有意义的方式相交。与非蛋白成分进行从头设计的能力要求合成和计算在共同的目标和应用上对齐。这种观点考察了这些领域的最新技术,并确定了具体的,相应的应用,以推进该领域向广义大分子设计。
    Protein synthesis methods have been adapted to incorporate an ever-growing level of non-natural components. Meanwhile, design of de novo protein structure and function has rapidly emerged as a viable capability. Yet, these two exciting trends have yet to intersect in a meaningful way. The ability to perform de novo design with non-proteinogenic components requires that synthesis and computation align on common targets and applications. This perspective examines the state of the art in these areas and identifies specific, consequential applications to advance the field toward generalized macromolecule design.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在过去的二十年里,虚拟筛选(VS)一直是一种有效的药物发现方法。今天,数十亿种商业上可获得的化合物被常规筛选,已经报道了许多VS的成功例子。VS方法继续发展,包括机器学习和基于物理的方法。
    作者研究了VS在药物发现中的最新例子,并讨论了来自计算命中发现实验(CACHE)挑战的关键评估的前瞻性命中发现结果。作者还强调了进行VS的成本考虑和开源选择,并研究了VS的化学空间覆盖和文库选择。
    先进的VS方法,包括使用机器学习技术和增加的计算机资源,以及容易进入合成可用的化学空间,商业和开源VS平台允许查询数十亿分子的超大型库(ULL)。令人印象深刻的潜在ULLVS活动在许多目标类别中产生了强大的结构新颖的命中。尽管如此,许多成功的当代VS方法仍然使用相当小的聚焦库。这种明显的二分法说明,VS最好以适合目的的方式选择合适的化学空间进行。需要开发更好的方法来解决更具挑战性的目标。
    UNASSIGNED: For the past two decades, virtual screening (VS) has been an efficient hit finding approach for drug discovery. Today, billions of commercially accessible compounds are routinely screened, and many successful examples of VS have been reported. VS methods continue to evolve, including machine learning and physics-based methods.
    UNASSIGNED: The authors examine recent examples of VS in drug discovery and discuss prospective hit finding results from the critical assessment of computational hit-finding experiments (CACHE) challenge. The authors also highlight the cost considerations and open-source options for conducting VS and examine chemical space coverage and library selections for VS.
    UNASSIGNED: The advancement of sophisticated VS approaches, including the use of machine learning techniques and increased computer resources as well as the ease of access to synthetically available chemical spaces, and commercial and open-source VS platforms allow for interrogating ultra-large libraries (ULL) of billions of molecules. An impressive number of prospective ULL VS campaigns have generated potent and structurally novel hits across many target classes. Nonetheless, many successful contemporary VS approaches still use considerably smaller focused libraries. This apparent dichotomy illustrates that VS is best conducted in a fit-for-purpose way choosing an appropriate chemical space. Better methods need to be developed to tackle more challenging targets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    数据稀缺是阻碍化学效应预测模型发展的最关键问题之一。利用来自相关任务的知识的多任务学习算法显示出处理有限数据任务的潜力。然而,当前的多任务方法主要集中在从任务标签可用于大多数训练样本的数据集进行学习。由于数据集是为不同的目的生成的,具有不同的化学空间,传统的多任务学习方法可能不适合。这项研究提出了一种新颖的多任务学习方法MTForestNet,可以处理数据稀缺问题,并从具有不同化学空间的任务中学习。MTForestNet由以渐进网络形式组织的随机森林分类器的节点组成,其中每个节点表示从特定任务中学习的随机森林模型。为了证明MTForestNet的有效性,收集并利用48个斑马鱼毒性数据集作为实例。其中,两项任务与其他任务有很大不同,只有1.3%的普通化学品与其他任务共享。在独立测试中,与单任务和多任务方法相比,MTForestNet的接收器工作特征曲线(AUC)值为0.911,具有较高的面积。从开发的斑马鱼毒性模型得出的总体毒性与实验确定的总体毒性密切相关。此外,开发的斑马鱼毒性模型的输出可以用作增强发育毒性预测的特征。开发的模型可有效预测斑马鱼的毒性,拟议的MTForestNet有望用于具有不同化学空间的任务,可用于其他任务。科学贡献提出了一种新颖的多任务学习算法MTForestNet,以解决使用具有不同化学空间的数据集开发模型的挑战,这是化学信息学任务的常见问题。作为一个例子,斑马鱼毒性预测模型是使用拟议的MTForestNet开发的,该模型提供了优于常规单任务和多任务学习方法的性能。此外,建立的斑马鱼毒性预测模型可以减少动物试验。
    Data scarcity is one of the most critical issues impeding the development of prediction models for chemical effects. Multitask learning algorithms leveraging knowledge from relevant tasks showed potential for dealing with tasks with limited data. However, current multitask methods mainly focus on learning from datasets whose task labels are available for most of the training samples. Since datasets were generated for different purposes with distinct chemical spaces, the conventional multitask learning methods may not be suitable. This study presents a novel multitask learning method MTForestNet that can deal with data scarcity problems and learn from tasks with distinct chemical space. The MTForestNet consists of nodes of random forest classifiers organized in the form of a progressive network, where each node represents a random forest model learned from a specific task. To demonstrate the effectiveness of the MTForestNet, 48 zebrafish toxicity datasets were collected and utilized as an example. Among them, two tasks are very different from other tasks with only 1.3% common chemicals shared with other tasks. In an independent test, MTForestNet with a high area under the receiver operating characteristic curve (AUC) value of 0.911 provided superior performance over compared single-task and multitask methods. The overall toxicity derived from the developed models of zebrafish toxicity is well correlated with the experimentally determined overall toxicity. In addition, the outputs from the developed models of zebrafish toxicity can be utilized as features to boost the prediction of developmental toxicity. The developed models are effective for predicting zebrafish toxicity and the proposed MTForestNet is expected to be useful for tasks with distinct chemical space that can be applied in other tasks.Scieific contributionA novel multitask learning algorithm MTForestNet was proposed to address the challenges of developing models using datasets with distinct chemical space that is a common issue of cheminformatics tasks. As an example, zebrafish toxicity prediction models were developed using the proposed MTForestNet which provide superior performance over conventional single-task and multitask learning methods. In addition, the developed zebrafish toxicity prediction models can reduce animal testing.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    幽门螺杆菌是胃癌的主要致病因子,尤其是非心脏胃癌.这种细菌依靠产生大量氨的脲酶来定殖宿主。在这里,该研究为通过探索已知抑制剂设计的高活性分子驱动脲酶抑制的结构模式提供了有价值的见解。首先,设计了一个集成模型来预测新型化合物在自动工作流程(R2=0.761)中的抑制活性,该工作流程结合了四种机器学习方法。数据集以化学空间为特征,包括分子支架,聚类分析,物理化学性质分布,和活动悬崖。通过这些分析,突出了负责不同活性的异羟肟酸基团和苯环。活性悬崖对未发现的异羟肟酸衍生物上苯环的取代基是显著增强活性的关键结构。此外,设计了11个异羟胺酸衍生物,名为mol1-11。分子动力学模拟结果表明,mol9表现出稳定的活性位点瓣的闭合构象,并有望成为有希望的候选药物幽门螺杆菌感染和进一步的体外,在体内,和临床试验证明在未来。
    Helicobacter pylori is the main causative agent of gastric cancer, especially non-cardiac gastric cancers. This bacterium relies on urease producing much ammonia to colonize the host. Herein, the study provides valuable insights into structural patterns driving urease inhibition for high-activity molecules designed via exploring known inhibitors. Firstly, an ensemble model was devised to predict the inhibitory activity of novel compounds in an automated workflow (R2 = 0.761) that combines four machine learning approaches. The dataset was characterized in terms of chemical space, including molecular scaffolds, clustering analysis, distribution for physicochemical properties, and activity cliffs. Through these analyses, the hydroxamic acid group and the benzene ring responsible for distinct activity were highlighted. Activity cliff pairs uncovered substituents of the benzene ring on hydroxamic acid derivatives are key structures for substantial activity enhancement. Moreover, 11 hydroxamic acid derivatives were designed, named mol1-11. Results of molecular dynamic simulations showed that the mol9 exhibited stabilization of the active site flap\'s closed conformation and are expected to be promising drug candidates for Helicobacter pylori infection and further in vitro, in vivo, and clinical trials to demonstrate in future.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    天然产物的化合物数据库在药物发现和开发项目中起着至关重要的作用,并在其他领域具有影响。比如食品化学研究,生态学和代谢组学。最近,我们汇集了拉丁美洲天然产品数据库(LANaPDB)的第一个版本,这是来自六个国家的研究人员的集体努力,目的是在具有大量生物多样性的地理区域整合一个公共和代表性的天然产品图书馆。本工作旨在对LANaPDB的更新版本和构成LANaPDB一部分的单独的十个化合物数据库的天然产品相似度进行比较和广泛的分析。拉丁美洲化合物数据库的天然产物相似度概况与公共领域的其他主要天然产物数据库和一组批准用于临床的小分子药物的概况形成对比。作为广泛表征的一部分,我们采用了几种天然产物相似性的化学信息学指标。这项研究的结果将引起从事天然产物数据库的全球社区的关注,不仅在拉丁美洲,而且在世界各地。
    Compound databases of natural products play a crucial role in drug discovery and development projects and have implications in other areas, such as food chemical research, ecology and metabolomics. Recently, we put together the first version of the Latin American Natural Product database (LANaPDB) as a collective effort of researchers from six countries to ensemble a public and representative library of natural products in a geographical region with a large biodiversity. The present work aims to conduct a comparative and extensive profiling of the natural product-likeness of an updated version of LANaPDB and the individual ten compound databases that form part of LANaPDB. The natural product-likeness profile of the Latin American compound databases is contrasted with the profile of other major natural product databases in the public domain and a set of small-molecule drugs approved for clinical use. As part of the extensive characterization, we employed several chemoinformatics metrics of natural product likeness. The results of this study will capture the attention of the global community engaged in natural product databases, not only in Latin America but across the world.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在现代“经济学”时代,人类暴露组的测量是遗传驱动因素和疾病结果之间的关键缺失环节。高分辨率质谱(HRMS),常规用于蛋白质组学和代谢组学,已成为广泛分布化学暴露剂和相关生物分子以进行准确质量测量的领先技术,高灵敏度,快速数据采集,增加化学空间的分辨率。非目标方法越来越容易获得,支持从传统假设驱动的转变,以定量为中心的有针对性的分析,以数据驱动,产生假设的化学暴露广泛的分析。然而,基于HRMS的曝光组学遇到了独特的挑战。需要新的分析和计算基础设施,以通过简化、可扩展,协调的工作流程和数据管道,允许纵向化学品暴露组跟踪,回顾性验证,和多组学整合,以实现有意义的健康导向推断。在这篇文章中,我们调查了关于最先进的基于HRMS的技术的文献,回顾当前的分析工作流程和信息管道,并为化学家提供有关暴露组学方法的最新参考,毒理学家,流行病学家,护理提供者,以及健康科学和医学的利益相关者。我们建议努力对适合用途的平台进行基准测试,以扩大化学空间的覆盖范围,包括气/液色谱-HRMS(GC-HRMS和LC-HRMS),讨论机会,挑战,以及推进新兴领域的战略。
    In the modern \"omics\" era, measurement of the human exposome is a critical missing link between genetic drivers and disease outcomes. High-resolution mass spectrometry (HRMS), routinely used in proteomics and metabolomics, has emerged as a leading technology to broadly profile chemical exposure agents and related biomolecules for accurate mass measurement, high sensitivity, rapid data acquisition, and increased resolution of chemical space. Non-targeted approaches are increasingly accessible, supporting a shift from conventional hypothesis-driven, quantitation-centric targeted analyses toward data-driven, hypothesis-generating chemical exposome-wide profiling. However, HRMS-based exposomics encounters unique challenges. New analytical and computational infrastructures are needed to expand the analysis coverage through streamlined, scalable, and harmonized workflows and data pipelines that permit longitudinal chemical exposome tracking, retrospective validation, and multi-omics integration for meaningful health-oriented inferences. In this article, we survey the literature on state-of-the-art HRMS-based technologies, review current analytical workflows and informatic pipelines, and provide an up-to-date reference on exposomic approaches for chemists, toxicologists, epidemiologists, care providers, and stakeholders in health sciences and medicine. We propose efforts to benchmark fit-for-purpose platforms for expanding coverage of chemical space, including gas/liquid chromatography-HRMS (GC-HRMS and LC-HRMS), and discuss opportunities, challenges, and strategies to advance the burgeoning field of the exposome.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    化学空间的探索是化学信息学的一个基本方面,特别是当人们探索一个大的化合物数据集,以将化学结构与分子性质联系起来。在这项研究中,我们在药效水平上扩展了我们以前在化学空间可视化方面的工作.而不是使用传统的亲和力二元分类(活性与非活性),我们引入了一种改进的方法,根据化合物的活性水平将其分为四个不同的类别:超活性,非常活跃,活跃,不活跃。这种分类丰富了应用于药效团空间的配色方案,其中药效团假说的颜色表示由相关化合物驱动。以BCR-ABL酪氨酸激酶为例,我们确定了与药效团活性不连续相对应的有趣区域,为结构-活动关系分析提供有价值的见解。
    The exploration of chemical space is a fundamental aspect of chemoinformatics, particularly when one explores a large compound data set to relate chemical structures with molecular properties. In this study, we extend our previous work on chemical space visualization at the pharmacophoric level. Instead of using conventional binary classification of affinity (active vs inactive), we introduce a refined approach that categorizes compounds into four distinct classes based on their activity levels: super active, very active, active, and inactive. This classification enriches the color scheme applied to pharmacophore space, where the color representation of a pharmacophore hypothesis is driven by the associated compounds. Using the BCR-ABL tyrosine kinase as a case study, we identified intriguing regions corresponding to pharmacophore activity discontinuities, providing valuable insights for structure-activity relationships analysis.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    化学空间的计算探索在现代化学信息学研究中对于加速发现新的生物活性化合物至关重要。在这项研究中,我们对分子发生器产生的潜在糖皮质激素受体(GR)配体的化学库进行了详细分析,莫弗.为了生成目标GR库并构建分类模型,来自ChEMBL数据库以及内部IMG库的结构,在初级荧光素酶报告细胞试验中对其生物活性进行了实验筛选,被利用。将靶向GR配体文库的组成与随机采样化学空间的参考文库进行比较。随机森林模型用于确定配体的生物活性,使用共形预测结合其适用域。已证明,与随机文库相比,GR文库显著富含GR配体。此外,前瞻性分析表明,Molpher成功设计了化合物,随后通过实验证实对GR具有活性。还鉴定了34个潜在的新GR配体的集合。此外,这项研究的一个重要贡献是建立了一个全面的工作流程来评估计算生成的配体,特别是那些对目标有潜在活动的人,这些目标很难停靠。
    Computational exploration of chemical space is crucial in modern cheminformatics research for accelerating the discovery of new biologically active compounds. In this study, we present a detailed analysis of the chemical library of potential glucocorticoid receptor (GR) ligands generated by the molecular generator, Molpher. To generate the targeted GR library and construct the classification models, structures from the ChEMBL database as well as from the internal IMG library, which was experimentally screened for biological activity in the primary luciferase reporter cell assay, were utilized. The composition of the targeted GR ligand library was compared with a reference library that randomly samples chemical space. A random forest model was used to determine the biological activity of ligands, incorporating its applicability domain using conformal prediction. It was demonstrated that the GR library is significantly enriched with GR ligands compared to the random library. Furthermore, a prospective analysis demonstrated that Molpher successfully designed compounds, which were subsequently experimentally confirmed to be active on the GR. A collection of 34 potential new GR ligands was also identified. Moreover, an important contribution of this study is the establishment of a comprehensive workflow for evaluating computationally generated ligands, particularly those with potential activity against targets that are challenging to dock.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号