Protein-coding potential

蛋白质编码潜能
  • 文章类型: Journal Article
    长链非编码RNA(lncRNAs)的鉴定和功能确定可以帮助更好地理解正常发育和疾病病理中的转录调控,因此需要在获得测序数据后将它们与蛋白质编码(pcRNA)区分开的方法。许多基于统计的算法,结构,物理,和序列的化学性质已被开发用于评估RNA的编码潜力以区分它们。为了设计不依赖于超参数调整和优化并准确评估的通用功能,我们从开放阅读框(ORF)对其相互作用的影响以及与序列位点的电强度设计了一系列特征,以进一步提高筛选的准确性。最后,根据我们设计的特征构建的单个模型满足强分类器标准,准确率在82%到89%之间,以及组合辅助特征后构建的模型的预测精度等于或超过一些最佳分类工具。此外,我们的方法不需要特殊的超参数调整操作,并且与其他方法相比对物种不敏感,这意味着这种方法可以很容易地应用于广泛的物种。此外,我们发现这些特征之间有一些相关性,为后续研究提供一定的参考。
    The identification and function determination of long non-coding RNAs (lncRNAs) can help to better understand the transcriptional regulation in both normal development and disease pathology, thereby demanding methods to distinguish them from protein-coding (pcRNAs) after obtaining sequencing data. Many algorithms based on the statistical, structural, physical, and chemical properties of the sequences have been developed for evaluating the coding potential of RNA to distinguish them. In order to design common features that do not rely on hyperparameter tuning and optimization and are evaluated accurately, we designed a series of features from the effects of open reading frames (ORFs) on their mutual interactions and with the electrical intensity of sequence sites to further improve the screening accuracy. Finally, the single model constructed from our designed features meets the strong classifier criteria, where the accuracy is between 82% and 89%, and the prediction accuracy of the model constructed after combining the auxiliary features equal to or exceed some best classification tools. Moreover, our method does not require special hyper-parameter tuning operations and is species insensitive compared to other methods, which means this method can be easily applied to a wide range of species. Also, we find some correlations between the features, which provides some reference for follow-up studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Review
    近年来,蛋白质基因组学和核糖体谱分析研究已经确定了人类基因组中非编码区编码的大量蛋白质。它们由mRNA和长非编码RNA(lncRNA)的非翻译区(UTR)中的小开放阅读框(sORF)编码。这些sORF编码的蛋白质(SEP)通常<150AA,并且表现出较差的进化保守性。其中的一个子集已在功能上进行了表征,并显示出在包括心脏和肌肉功能在内的基本生物学过程中起着重要作用。DNA修复,胚胎发育和各种人类疾病。人类基因组中存在多少个新的蛋白质编码区以及它们中的哪些部分在功能上是重要的仍然是个谜。在这次审查中,我们讨论目前在解开SEP方面的进展,用于识别它们的方法,他们的局限性和这些识别的可靠性。我们还讨论了功能特征的SEP及其在各种生物过程和疾病中的参与。最后,我们提供了与标准蛋白质相比的独特特征的见解,以及与在蛋白质参考数据库中注释这些蛋白质相关的挑战。
    In recent years, proteogenomics and ribosome profiling studies have identified a large number of proteins encoded by noncoding regions in the human genome. They are encoded by small open reading frames (sORFs) in the untranslated regions (UTRs) of mRNAs and long non-coding RNAs (lncRNAs). These sORF encoded proteins (SEPs) are often <150AA and show poor evolutionary conservation. A subset of them have been functionally characterized and shown to play an important role in fundamental biological processes including cardiac and muscle function, DNA repair, embryonic development and various human diseases. How many novel protein-coding regions exist in the human genome and what fraction of them are functionally important remains a mystery. In this review, we discuss current progress in unraveling SEPs, approaches used for their identification, their limitations and reliability of these identifications. We also discuss functionally characterized SEPs and their involvement in various biological processes and diseases. Lastly, we provide insights into their distinctive features compared to canonical proteins and challenges associated with annotating these in protein reference databases.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Preprint
    核糖体是信息处理的大分子机器,可将复杂的序列模式整合到信使RNA(mRNA)转录物中,以合成蛋白质。区分mRNAs与长链非编码RNAs(lncRNAs)的序列特征的研究可能会深入了解指导和调节翻译的信息。计算蛋白质编码潜力的计算方法对于在基因组注释期间区分mRNAs和lncRNAs非常重要。但是用于此任务的大多数机器学习方法依赖于先前已知的规则来定义特征。序列到序列(seq2seq)模型,特别是那些使用变压器网络的,已经证明能够学习单词之间复杂的语法关系来执行自然语言翻译。寻求利用生物学领域的这些进步,我们提出了用深度神经网络预测蛋白质编码潜力的seq2seq公式,并证明了同时学习从RNA到蛋白质的翻译相对于仅分类训练目标提高了分类性能。受基因发现的经典信号处理方法和基于傅立叶的图像处理神经网络的启发,我们引入LocalFilterNet(LFNet)。LFNet是具有感应偏差的网络结构,用于对编码序列中明显的三核苷酸周期性进行建模。我们将LFNet纳入编码器-解码器框架中,以测试翻译任务是否改善了转录本的分类及其序列特征的解释。我们使用得到的模型来计算核苷酸分辨率重要性得分,揭示可以帮助细胞机器区分mRNAs和lncRNAs的序列模式。最后,我们开发了一种从积分梯度估计突变效应的新方法,基于反向传播的特征属性,并描述在这种情况下有效逼近的难度。
    Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    最近的研究已经鉴定了许多具有编码和非编码功能的RNA。然而,决定这种双功能的序列特征在很大程度上仍然未知。在本研究中,我们开发和测试开放阅读框架(ORF)优势得分,我们定义为所有推定ORF长度之和中最长ORF的分数。该得分与编码转录物中的翻译效率和非编码RNA的翻译相关。在细菌和古细菌中,编码和非编码转录本具有狭窄的高和低ORF优势分布,分别,而真核生物的ORF优势分布相对较宽,编码和非编码转录物之间有相当大的重叠。重叠的程度与基因组的突变率和物种的有效种群大小呈正相关和负相关。分别。组织特异性转录物显示出比普遍表达的转录物更高的ORF优势,大多数组织特异性转录本在成熟睾丸中表达。这些数据表明,种群大小的减少和真核生物中睾丸的出现允许潜在的双功能RNA的进化。
    Recent studies have identified numerous RNAs with both coding and noncoding functions. However, the sequence characteristics that determine this bifunctionality remain largely unknown. In the present study, we develop and test the open reading frame (ORF) dominance score, which we define as the fraction of the longest ORF in the sum of all putative ORF lengths. This score correlates with translation efficiency in coding transcripts and with translation of noncoding RNAs. In bacteria and archaea, coding and noncoding transcripts have narrow distributions of high and low ORF dominance, respectively, whereas those of eukaryotes show relatively broader ORF dominance distributions, with considerable overlap between coding and noncoding transcripts. The extent of overlap positively and negatively correlates with the mutation rate of genomes and the effective population size of species, respectively. Tissue-specific transcripts show higher ORF dominance than ubiquitously expressed transcripts, and the majority of tissue-specific transcripts are expressed in mature testes. These data suggest that the decrease in population size and the emergence of testes in eukaryotic organisms allowed for the evolution of potentially bifunctional RNAs.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    环状RNA(circularRNAs,circRNAs)是一种新的非编码RNA,在microRNA(miRNA)功能和转录控制中起重要作用。揭示大豆circRNAs在低温胁迫反应中的作用机制,通过深度测序在LT(4°C)处理下进行大豆circRNAs的全基因组鉴定。在这项研究中,反向剪接位点的存在得到验证,circRNAs在LT反应中表现出特异性表达模式.基因本体论(GO)和京都基因和基因组百科全书(KEGG)分析表明,circRNAs可以参与LT响应过程。我们的研究揭示了一个新的circRNA-miRNA-mRNA网络,这与LT响应有关。此外,预测大豆circRNAs具有编码多肽或蛋白质的潜力。一起来看,我们的结果表明,大豆circRNAs可能编码蛋白质并参与LT反应的调节,为大豆中分子LT响应机制提供了线索。
    Circular RNAs (circRNAs) are a newly characterized type of noncoding RNA and play important roles in microRNA (miRNA) function and transcriptional control. To unravel the mechanism of soybean circRNAs in low-temperature (LT) stress response, genome-wide identification of soybean circRNAs was conducted under LT (4 °C) treatment via deep sequencing. In this study, the existence of backsplicing sites was validated and circRNAs exhibited specific expression patterns in response to LT. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses showed that circRNAs could participate in LT-responsive processes. Our study revealed a new circRNA-miRNA-mRNA network, which is involved in LT responses. Furthermore, soybean circRNAs were predicted to have potential to encode polypeptides or protein. Taken together, our results indicate that soybean circRNAs might encode proteins and be involved in the regulation of LT responses, providing clues regarding the molecular LT-responsive mechanisms in soybean.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    Long noncoding RNAs (lncRNAs), generally longer than 200 nucleotides and with poor protein coding potential, are usually considered collectively as a heterogeneous class of RNAs. Recently, an increasing number of studies have shown that lncRNAs can involve in various critical biological processes and a number of complex human diseases. Not only the primary sequences of many lncRNAs are directly interrelated to a specific functional role, strong evidence suggests that their secondary structures are even more interrelated to their known functions. As functional molecules, lncRNAs have become more and more relevant to many researchers. Here, we review recent, state-of-the-art advances in the three levels (the primary sequence, the secondary structure and the function annotation) of the lncRNA research, as well as computational methods for lncRNA data analysis.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号