关键词: Open reading frame Prediction Protein-coding potential Wavelet lncRNA

Mesh : Open Reading Frames / genetics RNA, Long Noncoding / genetics Algorithms Proteins / genetics

来  源:   DOI:10.1016/j.compbiomed.2023.107752

Abstract:
The identification and function determination of long non-coding RNAs (lncRNAs) can help to better understand the transcriptional regulation in both normal development and disease pathology, thereby demanding methods to distinguish them from protein-coding (pcRNAs) after obtaining sequencing data. Many algorithms based on the statistical, structural, physical, and chemical properties of the sequences have been developed for evaluating the coding potential of RNA to distinguish them. In order to design common features that do not rely on hyperparameter tuning and optimization and are evaluated accurately, we designed a series of features from the effects of open reading frames (ORFs) on their mutual interactions and with the electrical intensity of sequence sites to further improve the screening accuracy. Finally, the single model constructed from our designed features meets the strong classifier criteria, where the accuracy is between 82% and 89%, and the prediction accuracy of the model constructed after combining the auxiliary features equal to or exceed some best classification tools. Moreover, our method does not require special hyper-parameter tuning operations and is species insensitive compared to other methods, which means this method can be easily applied to a wide range of species. Also, we find some correlations between the features, which provides some reference for follow-up studies.
摘要:
长链非编码RNA(lncRNAs)的鉴定和功能确定可以帮助更好地理解正常发育和疾病病理中的转录调控,因此需要在获得测序数据后将它们与蛋白质编码(pcRNA)区分开的方法。许多基于统计的算法,结构,物理,和序列的化学性质已被开发用于评估RNA的编码潜力以区分它们。为了设计不依赖于超参数调整和优化并准确评估的通用功能,我们从开放阅读框(ORF)对其相互作用的影响以及与序列位点的电强度设计了一系列特征,以进一步提高筛选的准确性。最后,根据我们设计的特征构建的单个模型满足强分类器标准,准确率在82%到89%之间,以及组合辅助特征后构建的模型的预测精度等于或超过一些最佳分类工具。此外,我们的方法不需要特殊的超参数调整操作,并且与其他方法相比对物种不敏感,这意味着这种方法可以很容易地应用于广泛的物种。此外,我们发现这些特征之间有一些相关性,为后续研究提供一定的参考。
公众号