关键词: COVID-19 Essential proteins Human Machine learning Protein-protein interaction network Sequence features Yeast

Mesh : Humans Protein Interaction Maps Saccharomyces cerevisiae Bayes Theorem Proteins / chemistry Machine Learning

来  源:   DOI:10.7717/peerj.17010   PDF(Pubmed)

Abstract:
Proteins are considered indispensable for facilitating an organism\'s viability, reproductive capabilities, and other fundamental physiological functions. Conventional biological assays are characterized by prolonged duration, extensive labor requirements, and financial expenses in order to identify essential proteins. Therefore, it is widely accepted that employing computational methods is the most expeditious and effective approach to successfully discerning essential proteins. Despite being a popular choice in machine learning (ML) applications, the deep learning (DL) method is not suggested for this specific research work based on sequence features due to the restricted availability of high-quality training sets of positive and negative samples. However, some DL works on limited availability of data are also executed at recent times which will be our future scope of work. Conventional ML techniques are thus utilized in this work due to their superior performance compared to DL methodologies. In consideration of the aforementioned, a technique called EPI-SF is proposed here, which employs ML to identify essential proteins within the protein-protein interaction network (PPIN). The protein sequence is the primary determinant of protein structure and function. So, initially, relevant protein sequence features are extracted from the proteins within the PPIN. These features are subsequently utilized as input for various machine learning models, including XGB Boost Classifier, AdaBoost Classifier, logistic regression (LR), support vector classification (SVM), Decision Tree model (DT), Random Forest model (RF), and Naïve Bayes model (NB). The objective is to detect the essential proteins within the PPIN. The primary investigation conducted on yeast examined the performance of various ML models for yeast PPIN. Among these models, the RF model technique had the highest level of effectiveness, as indicated by its precision, recall, F1-score, and AUC values of 0.703, 0.720, 0.711, and 0.745, respectively. It is also found to be better in performance when compared to the other state-of-arts based on traditional centrality like betweenness centrality (BC), closeness centrality (CC), etc. and deep learning methods as well like DeepEP, as emphasized in the result section. As a result of its favorable performance, EPI-SF is later employed for the prediction of novel essential proteins inside the human PPIN. Due to the tendency of viruses to selectively target essential proteins involved in the transmission of diseases within human PPIN, investigations are conducted to assess the probable involvement of these proteins in COVID-19 and other related severe diseases.
摘要:
蛋白质被认为是必不可少的促进生物体的生存能力,生殖能力,和其他基本生理功能。传统的生物测定的特点是持续时间延长,广泛的劳动力需求,和财务费用,以确定必需的蛋白质。因此,人们普遍认为,采用计算方法是成功识别必需蛋白质的最迅速和有效的方法。尽管是机器学习(ML)应用程序中的热门选择,由于正样本和负样本的高质量训练集的可用性有限,因此不建议将深度学习(DL)方法用于基于序列特征的特定研究工作。然而,一些关于有限的数据可用性的DL工作也在最近执行,这将是我们未来的工作范围。因此,与DL方法相比,由于其优越的性能,因此在这项工作中使用了常规的ML技术。考虑到上述问题,这里提出了一种称为EPI-SF的技术,它使用ML来识别蛋白质-蛋白质相互作用网络(PPIN)中的必需蛋白质。蛋白质序列是蛋白质结构和功能的主要决定因素。所以,最初,从PPIN内的蛋白质中提取相关的蛋白质序列特征。这些特征随后被用作各种机器学习模型的输入,包括XGB增强分类器,AdaBoost分类器,逻辑回归(LR),支持向量分类(SVM),决策树模型(DT),随机森林模型(RF)和朴素贝叶斯模型(NB)。目的是检测PPIN内的必需蛋白。对酵母进行的初步调查检查了酵母PPIN的各种ML模型的性能。在这些模型中,射频模型技术的有效性最高,正如它的精确度所表明的,召回,F1分数,AUC值分别为0.703、0.720、0.711和0.745。与基于传统中心性的其他国家相比,也发现性能更好,例如中间性中心性(BC),接近中心性(CC),等。深度学习方法也像DeepEP,正如结果部分所强调的那样。由于其良好的性能,EPI-SF后来被用于预测人PPIN内部的新型必需蛋白。由于病毒倾向于选择性靶向参与人类PPIN内疾病传播的必需蛋白,进行调查以评估这些蛋白质可能参与COVID-19和其他相关严重疾病。
公众号