computational prediction

计算预测
  • 文章类型: Journal Article
    与信使RNA相反,各种现有的长链非编码RNA(lncRNAs)的功能很大程度上取决于它们的结构,这决定了与伴侣分子的相互作用。因此,lncRNAs二级结构的确定或预测对于揭示其功能至关重要。用于预测RNA二级结构的经典方法基于动态编程和热力学计算。在过去的4年里,越来越多的基于机器学习(ML)的模型,包括深度学习(DL),在蛋白质等生物分子的结构预测方面取得了突破性的性能,并且在短转录物折叠方面优于经典方法。然而,lncRNA的准确预测仍然远未得到有效解决。值得注意的是,无数的新建议尚未经过系统和实验评估。
    在这项工作中,我们使用统一且一致的实验设置,比较了经典方法以及最近提出的RNA序列二级结构预测方法的性能。我们使用公开的3023酵母RNA序列的结构概况,和来自不同物种的良好表征的lncRNA结构的新基准。此外,我们提出了一种新的指标来评估方法的预测性能,完全基于通常用于分析RNA结构的化学探测数据,避免在使用点括号引用时由计算预测合并的任何潜在偏差。我们的结果提供了对现有方法的全面比较评估,以及一种新颖的公共基准资源,以帮助开发和比较未来的方法。
    完整的源代码和基准测试数据集可从以下网址获得:https://github.com/sinc-lab/lncRNA-folding。
    lbugnon@sinc。unl.edu..
    In contrast to messenger RNAs, the function of the wide range of existing long noncoding RNAs (lncRNAs) largely depends on their structure, which determines interactions with partner molecules. Thus, the determination or prediction of the secondary structure of lncRNAs is critical to uncover their function. Classical approaches for predicting RNA secondary structure have been based on dynamic programming and thermodynamic calculations. In the last 4 years, a growing number of machine learning (ML)-based models, including deep learning (DL), have achieved breakthrough performance in structure prediction of biomolecules such as proteins and have outperformed classical methods in short transcripts folding. Nevertheless, the accurate prediction for lncRNA still remains far from being effectively solved. Notably, the myriad of new proposals has not been systematically and experimentally evaluated.
    In this work, we compare the performance of the classical methods as well as the most recently proposed approaches for secondary structure prediction of RNA sequences using a unified and consistent experimental setup. We use the publicly available structural profiles for 3023 yeast RNA sequences, and a novel benchmark of well-characterized lncRNA structures from different species. Moreover, we propose a novel metric to assess the predictive performance of methods, exclusively based on the chemical probing data commonly used for profiling RNA structures, avoiding any potential bias incorporated by computational predictions when using dot-bracket references. Our results provide a comprehensive comparative assessment of existing methodologies, and a novel and public benchmark resource to aid in the development and comparison of future approaches.
    Full source code and benchmark datasets are available at: https://github.com/sinc-lab/lncRNA-folding.
    lbugnon@sinc.unl.edu.ar.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    A number of machine learning (ML)-based algorithms have been proposed for predicting mutation-induced stability changes in proteins. In this critical review, we used hypothetical reverse mutations to evaluate the performance of five representative algorithms and found all of them suffer from the problem of overfitting. This approach is based on the fact that if a wild-type protein is more stable than a mutant protein, then the same mutant is less stable than the wild-type protein. We analyzed the underlying issues and suggest that the main causes of the overfitting problem include that the numbers of training cases were too small, and the features used in the models were not sufficiently informative for the task. We make recommendations on how to avoid overfitting in this important research area and improve the reliability and robustness of ML-based algorithms in general.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Comparative Study
    Drug-protein interactions (DPIs) underlie the desired therapeutic actions and the adverse side effects of a significant majority of drugs. Computational prediction of DPIs facilitates research in drug discovery, characterization and repurposing. Similarity-based methods that do not require knowledge of protein structures are particularly suitable for druggable genome-wide predictions of DPIs. We review 35 high-impact similarity-based predictors that were published in the past decade. We group them based on three types of similarities and their combinations that they use. We discuss and compare key aspects of these methods including source databases, internal databases and their predictive models. Using our novel benchmark database, we perform comparative empirical analysis of predictive performance of seven types of representative predictors that utilize each type of similarity individually and all possible combinations of similarities. We assess predictive quality at the database-wide DPI level and we are the first to also include evaluation over individual drugs. Our comprehensive analysis shows that predictors that use more similarity types outperform methods that employ fewer similarities, and that the model combining all three types of similarities secures area under the receiver operating characteristic curve of 0.93. We offer a comprehensive analysis of sensitivity of predictive performance to intrinsic and extrinsic characteristics of the considered predictors. We find that predictive performance is sensitive to low levels of similarities between sequences of the drug targets and several extrinsic properties of the input drug structures, drug profiles and drug targets. The benchmark database and a webserver for the seven predictors are freely available at http://biomine.cs.vcu.edu/servers/CONNECTOR/.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号