Protein-coding potential

  • 文章类型: Journal Article
    The identification and function determination of long non-coding RNAs (lncRNAs) can help to better understand the transcriptional regulation in both normal development and disease pathology, thereby demanding methods to distinguish them from protein-coding (pcRNAs) after obtaining sequencing data. Many algorithms based on the statistical, structural, physical, and chemical properties of the sequences have been developed for evaluating the coding potential of RNA to distinguish them. In order to design common features that do not rely on hyperparameter tuning and optimization and are evaluated accurately, we designed a series of features from the effects of open reading frames (ORFs) on their mutual interactions and with the electrical intensity of sequence sites to further improve the screening accuracy. Finally, the single model constructed from our designed features meets the strong classifier criteria, where the accuracy is between 82% and 89%, and the prediction accuracy of the model constructed after combining the auxiliary features equal to or exceed some best classification tools. Moreover, our method does not require special hyper-parameter tuning operations and is species insensitive compared to other methods, which means this method can be easily applied to a wide range of species. Also, we find some correlations between the features, which provides some reference for follow-up studies.






  • 文章类型: Review
    In recent years, proteogenomics and ribosome profiling studies have identified a large number of proteins encoded by noncoding regions in the human genome. They are encoded by small open reading frames (sORFs) in the untranslated regions (UTRs) of mRNAs and long non-coding RNAs (lncRNAs). These sORF encoded proteins (SEPs) are often <150AA and show poor evolutionary conservation. A subset of them have been functionally characterized and shown to play an important role in fundamental biological processes including cardiac and muscle function, DNA repair, embryonic development and various human diseases. How many novel protein-coding regions exist in the human genome and what fraction of them are functionally important remains a mystery. In this review, we discuss current progress in unraveling SEPs, approaches used for their identification, their limitations and reliability of these identifications. We also discuss functionally characterized SEPs and their involvement in various biological processes and diseases. Lastly, we provide insights into their distinctive features compared to canonical proteins and challenges associated with annotating these in protein reference databases.






  • 文章类型: Preprint
    Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.






  • 文章类型: Journal Article
    Recent studies have identified numerous RNAs with both coding and noncoding functions. However, the sequence characteristics that determine this bifunctionality remain largely unknown. In the present study, we develop and test the open reading frame (ORF) dominance score, which we define as the fraction of the longest ORF in the sum of all putative ORF lengths. This score correlates with translation efficiency in coding transcripts and with translation of noncoding RNAs. In bacteria and archaea, coding and noncoding transcripts have narrow distributions of high and low ORF dominance, respectively, whereas those of eukaryotes show relatively broader ORF dominance distributions, with considerable overlap between coding and noncoding transcripts. The extent of overlap positively and negatively correlates with the mutation rate of genomes and the effective population size of species, respectively. Tissue-specific transcripts show higher ORF dominance than ubiquitously expressed transcripts, and the majority of tissue-specific transcripts are expressed in mature testes. These data suggest that the decrease in population size and the emergence of testes in eukaryotic organisms allowed for the evolution of potentially bifunctional RNAs.






  • 文章类型: Journal Article
    Circular RNAs (circRNAs) are a newly characterized type of noncoding RNA and play important roles in microRNA (miRNA) function and transcriptional control. To unravel the mechanism of soybean circRNAs in low-temperature (LT) stress response, genome-wide identification of soybean circRNAs was conducted under LT (4 °C) treatment via deep sequencing. In this study, the existence of backsplicing sites was validated and circRNAs exhibited specific expression patterns in response to LT. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses showed that circRNAs could participate in LT-responsive processes. Our study revealed a new circRNA-miRNA-mRNA network, which is involved in LT responses. Furthermore, soybean circRNAs were predicted to have potential to encode polypeptides or protein. Taken together, our results indicate that soybean circRNAs might encode proteins and be involved in the regulation of LT responses, providing clues regarding the molecular LT-responsive mechanisms in soybean.






  • 文章类型: Journal Article
    Long noncoding RNAs (lncRNAs), generally longer than 200 nucleotides and with poor protein coding potential, are usually considered collectively as a heterogeneous class of RNAs. Recently, an increasing number of studies have shown that lncRNAs can involve in various critical biological processes and a number of complex human diseases. Not only the primary sequences of many lncRNAs are directly interrelated to a specific functional role, strong evidence suggests that their secondary structures are even more interrelated to their known functions. As functional molecules, lncRNAs have become more and more relevant to many researchers. Here, we review recent, state-of-the-art advances in the three levels (the primary sequence, the secondary structure and the function annotation) of the lncRNA research, as well as computational methods for lncRNA data analysis.





