关键词: Bioinformatics Convolutional neural network Deep learning Long short-term memory Machine learning Support vector machine Transcription start site

来  源:   DOI:10.7717/peerj-cs.1340   PDF(Pubmed)

Abstract:
Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.
摘要:
识别转录起始位点是基因识别的关键。在相关问题中已经采用了几种方法,例如检测翻译起始位点或启动子,许多最新的基于机器学习的。深度学习方法已被证明对这项任务非常有效。但是它们在转录起始位点鉴定中的应用还没有得到深入的探索。此外,很少有现有的作品没有将他们的方法与支持向量机(SVM)进行比较,这个研究领域最成熟的技术,也不提供研究中使用的精选数据集。在这个特定问题中发表的论文数量减少可以解释为缺乏数据集。鉴于支持向量机和深度神经网络都已应用于相关问题并取得了显著的效果,我们比较了它们在转录起始位点预测中的表现,结论是SVM计算慢得多,和深度学习方法,特别是长短期记忆神经网络(LSTM),最适合使用序列而不是SVM。为此,我们使用参考人类基因组GRCh38。此外,我们研究了与数据处理相关的两个不同方面:生成训练样本的正确方法和数据的不平衡性质。此外,还使用小鼠基因组测试了所研究模型的泛化性能,其中LSTM神经网络从其他算法中脱颖而出。总而言之,本文分析了转录起始位点识别中的最佳结构选择,以及生成转录起始位点数据集的方法,包括Ensembl中可用的任何物种的负实例。我们发现,深度学习方法比SVM更适合解决这个问题,更高效,更好地适应长序列和大量数据。我们还创建了一个足够大的转录起始位点(TSS)数据集,用于深度学习实验。
公众号