evolutionary inference

  • 文章类型: Journal Article
    准确理解酶的生物学功能对于病理学和工业生物技术中的各种任务至关重要。然而,现有方法通常速度不够快,对预测结果缺乏解释,这严重限制了它们的实际应用。根据我们之前的工作,Deepre,我们通过设计新颖的自我引导注意力并结合通过大型蛋白质语言模型学习的生物学知识,提出了一种新的可解释和快速版本(ifDEEPre),以准确预测酶的佣金数量并确认其功能。新颖的自我引导注意力旨在优化表征的独特贡献,自动检测关键蛋白质基序以提供有意义的解释。从原始蛋白质序列中学习的表示经过严格筛选,以提高框架的运行速度,比DEEPre快50倍,同时需要小12.89倍的存储空间。大型语言模块被纳入,以学习数以亿计的蛋白质的物理特性,扩展整个网络的生物学知识。大量的实验表明,如果DEEPre优于所有当前的方法,在新数据集上实现超过14.22%的F1分数。此外,经过训练的ifDEEPre模型通过仅获取没有标记信息的原始序列来准确捕获多级蛋白质生物学模式并推断酶的进化趋势。同时,如果DEEPre预测不同酵母亚种之间的进化关系,这与地面事实高度一致。案例研究表明,如果DEEPre能够检测到关键的氨基酸基序,这对设计新型酶具有重要意义。运行ifDEEPre的Web服务器可在https://proj获得。CSE。中大。edu.hk/aihlab/ifdeepre/为公众提供便捷的服务。同时,ifDEEPre可在GitHub上免费获得,网址为https://github.com/ml4bio/ifDEEPre/。
    Accurate understanding of the biological functions of enzymes is vital for various tasks in both pathologies and industrial biotechnology. However, the existing methods are usually not fast enough and lack explanations on the prediction results, which severely limits their real-world applications. Following our previous work, DEEPre, we propose a new interpretable and fast version (ifDEEPre) by designing novel self-guided attention and incorporating biological knowledge learned via large protein language models to accurately predict the commission numbers of enzymes and confirm their functions. Novel self-guided attention is designed to optimize the unique contributions of representations, automatically detecting key protein motifs to provide meaningful interpretations. Representations learned from raw protein sequences are strictly screened to improve the running speed of the framework, 50 times faster than DEEPre while requiring 12.89 times smaller storage space. Large language modules are incorporated to learn physical properties from hundreds of millions of proteins, extending biological knowledge of the whole network. Extensive experiments indicate that ifDEEPre outperforms all the current methods, achieving more than 14.22% larger F1-score on the NEW dataset. Furthermore, the trained ifDEEPre models accurately capture multi-level protein biological patterns and infer evolutionary trends of enzymes by taking only raw sequences without label information. Meanwhile, ifDEEPre predicts the evolutionary relationships between different yeast sub-species, which are highly consistent with the ground truth. Case studies indicate that ifDEEPre can detect key amino acid motifs, which have important implications for designing novel enzymes. A web server running ifDEEPre is available at https://proj.cse.cuhk.edu.hk/aihlab/ifdeepre/ to provide convenient services to the public. Meanwhile, ifDEEPre is freely available on GitHub at https://github.com/ml4bio/ifDEEPre/.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    血管紧张素转换酶2(ACE2)是冠状病毒SARS-CoV-2结合的细胞受体,用于进入和感染人类细胞。COVID-19,由冠状病毒引起的大流行疾病,涉及呼吸道疾病以外的多种病理,包括微血栓形成(微凝血),细胞因子风暴,和炎症反应影响许多器官系统。长期的慢性病可以持续数月,通常在病原体不再被检测到之后。更好地了解ACE2与之相互作用的蛋白质可以揭示与这些疾病表现相关的信息以及可能的治疗途径。我们已经采取了一种方法来预测候选ACE2相互作用蛋白,该方法使用进化推断来鉴定一组与ACE2“共同进化”的哺乳动物蛋白。的方法,称为进化速率相关性(ERC),检测在哺乳动物进化过程中表现出高度相关进化率的蛋白质。这些蛋白质是与ACE2受体生物相互作用的候选物。该方法揭示了许多与COVID-19病理潜在相关的关键ACE2蛋白相互作用。先前已报道一些蛋白质与严重的COVID-19相关,但目前尚不清楚它们与ACE2相互作用,而其他预测的新型ACE2相互作用物与该疾病具有潜在相关性。使用蛋白质ERC的倒数排名,我们已经确定了与COVID-19病理相关的强相互关联的ACE2相关蛋白网络。ACE2与凝血途径蛋白有明确的联系,如凝血因子V和纤维蛋白原成分FGA,FGB,FGG,后者可能通过与Clusterin(清除错误折叠的细胞外蛋白)和GPR141(其功能相对未知)的ACE2连接介导。ACE2还连接到参与细胞因子信号传导和免疫反应的蛋白质(例如XCR1,IFNAR2和TLR8),以及雄激素受体(AR)。ERC预筛选方法已经阐明了相对未表征的蛋白质的可能功能和表征良好的蛋白质的可能新功能。提出了验证ERC预测的ACE2蛋白相互作用的建议。我们认为ACE2具有在SARS-CoV-2感染过程中被破坏的新型蛋白质相互作用,有助于COVID-19病理的频谱。
    Angiotensin-converting enzyme 2 (ACE2) is the cell receptor that the coronavirus SARS-CoV-2 binds to and uses to enter and infect human cells. COVID-19, the pandemic disease caused by the coronavirus, involves diverse pathologies beyond those of a respiratory disease, including micro-thrombosis (micro-clotting), cytokine storms, and inflammatory responses affecting many organ systems. Longer-term chronic illness can persist for many months, often well after the pathogen is no longer detected. A better understanding of the proteins that ACE2 interacts with can reveal information relevant to these disease manifestations and possible avenues for treatment. We have undertaken an approach to predict candidate ACE2 interacting proteins which uses evolutionary inference to identify a set of mammalian proteins that \"coevolve\" with ACE2. The approach, called evolutionary rate correlation (ERC), detects proteins that show highly correlated evolutionary rates during mammalian evolution. Such proteins are candidates for biological interactions with the ACE2 receptor. The approach has uncovered a number of key ACE2 protein interactions of potential relevance to COVID-19 pathologies. Some proteins have previously been reported to be associated with severe COVID-19, but are not currently known to interact with ACE2, while additional predicted novel ACE2 interactors are of potential relevance to the disease. Using reciprocal rankings of protein ERCs, we have identified strongly interconnected ACE2 associated protein networks relevant to COVID-19 pathologies. ACE2 has clear connections to coagulation pathway proteins, such as Coagulation Factor V and fibrinogen components FGA, FGB, and FGG, the latter possibly mediated through ACE2 connections to Clusterin (which clears misfolded extracellular proteins) and GPR141 (whose functions are relatively unknown). ACE2 also connects to proteins involved in cytokine signaling and immune response (e.g. XCR1, IFNAR2 and TLR8), and to Androgen Receptor (AR). The ERC prescreening approach has elucidated possible functions for relatively uncharacterized proteins and possible new functions for well-characterized ones. Suggestions are made for the validation of ERC-predicted ACE2 protein interactions. We propose that ACE2 has novel protein interactions that are disrupted during SARS-CoV-2 infection, contributing to the spectrum of COVID-19 pathologies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Ewen的抽样公式是将概率和数论与分子遗传学和分子进化联系起来的基础理论结果;它是检验中性进化论所需的分析结果。并已被直接或间接地用于许多人口遗传学统计中。Ewen的抽样公式,反过来,与第一类斯特林数密切相关。这里,我们探索了这些斯特林数的累积分布函数,这使得可以对总和进行单一的直接估计,使用不完全β函数的表示。该估计器实现了一种改进的方法,用于计算一个有用统计量的渐近估计,Fu的[公式:见正文]通过将计算从涉及斯特林数的项之和简化为单个估计值,我们同时提高了准确性并大大提高了速度。
    Ewen\'s sampling formula is a foundational theoretical result that connects probability and number theory with molecular genetics and molecular evolution; it was the analytical result required for testing the neutral theory of evolution, and has since been directly or indirectly utilized in a number of population genetics statistics. Ewen\'s sampling formula, in turn, is deeply connected to Stirling numbers of the first kind. Here, we explore the cumulative distribution function of these Stirling numbers, which enables a single direct estimate of the sum, using representations in terms of the incomplete beta function. This estimator enables an improved method for calculating an asymptotic estimate for one useful statistic, Fu\'s [Formula: see text] By reducing the calculation from a sum of terms involving Stirling numbers to a single estimate, we simultaneously improve accuracy and dramatically increase speed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    It is now recognised that the biology of almost any organism cannot be fully understood without recognising the existence and potential functional importance of associated microbes. Arguably, the emergence of this holistic viewpoint may never have occurred without the development of a crucial molecular technique, 16S rDNA amplicon sequencing, which allowed microbial communities to be easily profiled across a broad range of contexts. A diverse array of molecular techniques are now used to profile microbial communities, infer their evolutionary histories, visualise them in host tissues, and measure their molecular activity. In this review, we examine each of these categories of measurement and inference with a focus on the questions they make tractable, and the degree to which their capabilities and limitations shape our view of the holobiont.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    Inference of how evolutionary forces have shaped extant genetic diversity is a cornerstone of modern comparative sequence analysis. Advances in sequence generation and increased statistical sophistication of relevant methods now allow researchers to extract ever more evolutionary signal from the data, albeit at an increased computational cost. Here, we announce the release of Datamonkey 2.0, a completely re-engineered version of the Datamonkey web-server for analyzing evolutionary signatures in sequence data. For this endeavor, we leveraged recent developments in open-source libraries that facilitate interactive, robust, and scalable web application development. Datamonkey 2.0 provides a carefully curated collection of methods for interrogating coding-sequence alignments for imprints of natural selection, packaged as a responsive (i.e. can be viewed on tablet and mobile devices), fully interactive, and API-enabled web application. To complement Datamonkey 2.0, we additionally release HyPhy Vision, an accompanying JavaScript application for visualizing analysis results. HyPhy Vision can also be used separately from Datamonkey 2.0 to visualize locally executed HyPhy analyses. Together, Datamonkey 2.0 and HyPhy Vision showcase how scientific software development can benefit from general-purpose open-source frameworks. Datamonkey 2.0 is freely and publicly available at http://www.datamonkey.org, and the underlying codebase is available from https://github.com/veg/datamonkey-js.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    了解进化中突变和重组事件的频率和相对权重对于理解微生物如何达到拟合表型至关重要。传统上,这些进化参数是通过使用来自多位点序列分型(MLST)的数据推断的,众所周知,这产生了相互矛盾的结果。在不久的将来,这些估计肯定会通过全基因组序列的计算分析来进行.然而,不知道这种方法是否会产生准确的结果,因为细菌基因组表现出基因座类别的异质性表示,目前尚不清楚基因座性质如何影响此类估计。因此,我们评估了突变和重组推断是如何由具有不同遗传特征的基因座形成的,以沙眼衣原体为研究模型。我们发现,分配大量等位基因和正选择基因的基因座产生了非收敛估计和不一致的系统发育,因此更容易出现混淆算法。出乎意料的是,对于正在评估的模型,管家基因和非编码区以类似的方式形成了估计,这指出了后者在沙眼衣原体进化中的非随机作用。尽管目前的结果与一种特定的细菌有关,我们推测微生物特异性基因组结构(如编码能力,多态性分散,和正选择的基因座的分数)可以在估计重组和突变率时差异缓冲混杂因素的影响,因此,影响为此目的使用全基因组序列的准确性。在讨论通过全基因组序列分析获得的结果时,应考虑与计算机推理相关的这种推定偏差。其中“一刀切”的方法可能不适用。
    The knowledge of the frequency and relative weight of mutation and recombination events in evolution is essential for understanding how microorganisms reach fitted phenotypes. Traditionally, these evolutionary parameters have been inferred by using data from multilocus sequence typing (MLST), which is known to have yielded conflicting results. In the near future, these estimations will certainly be performed by computational analyses of full-genome sequences. However, it is not known whether this approach will yield accurate results as bacterial genomes exhibit heterogeneous representation of loci categories, and it is not clear how loci nature impacts such estimations. Therefore, we assessed how mutation and recombination inferences are shaped by loci with different genetic features, using the bacterium Chlamydia trachomatis as the study model. We found that loci assigning a high number of alleles and positively selected genes yielded nonconvergent estimates and incongruent phylogenies and thus are more prone to confound algorithms. Unexpectedly, for the model under evaluation, housekeeping genes and noncoding regions shaped estimations in a similar manner, which points to a nonrandom role of the latter in C. trachomatis evolution. Although the present results relate to a specific bacterium, we speculate that microbe-specific genomic architectures (such as coding capacity, polymorphism dispersion, and fraction of positively selected loci) may differentially buffer the effect of the confounding factors when estimating recombination and mutation rates and, thus, influence the accuracy of using full-genome sequences for such purpose. This putative bias associated with in silico inferences should be taken into account when discussing the results obtained by the analyses of full-genome sequences, in which the \"one size fits all\" approach may not be applicable.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号