关键词: AlphaFold Alternative food protein Deep learning Protein solubility prediction Seed proteins

来  源:   DOI:10.1016/j.ijbiomac.2024.134601

Abstract:
Accurate protein solubility prediction is crucial in screening suitable candidates for food application. Existing models often rely only on sequences, overlooking important structural details. In this study, a regression model for protein solubility was developed using both the sequences and predicted structures of 2983 E. coli proteins. The sequence and structural level properties of the proteins were bioinformatically extracted and subjected to multilayer perceptron (MLP). Moreover, residue level features and contact maps were utilized to construct a graph convolutional network (GCN). The out-of-fold predictions of the two models were combined and fed into multiple meta-regressors to create a stacking model. The stacking model with support vector regressor (SVR) achieved R2 of 0.502 and 0.468 on test and external validation datasets, respectively, displaying higher performance compared to existing regression models. Based on the improved performance compared to its based models, the stacking model effectively captured the strength of its base models as well as the significance of the different features used. Furthermore, the model\'s transferability was indirectly validated on a dataset of seed storage proteins using Osborne definition as well as on a case study using molecular dynamic simulation, showing potential for application beyond microbial proteins to food and agriculture-related ones.
摘要:
准确的蛋白质溶解度预测对于筛选适合食品应用的候选物至关重要。现有的模型通常只依赖于序列,俯瞰重要的结构细节。在这项研究中,使用2983个大肠杆菌蛋白质的序列和预测结构建立蛋白质溶解度的回归模型。通过生物信息学提取蛋白质的序列和结构水平特性,并对其进行多层感知器(MLP)处理。此外,利用残差级特征和接触图构造图卷积网络(GCN)。将两个模型的非折叠预测进行组合并馈送到多个元回归变量中以创建堆叠模型。具有支持向量回归量(SVR)的堆叠模型在测试和外部验证数据集上实现了0.502和0.468的R2,分别,与现有回归模型相比,显示更高的性能。基于与基于其的模型相比改进的性能,堆叠模型有效地捕获了其基础模型的强度以及所使用的不同特征的重要性。此外,使用Osborne定义的种子储存蛋白数据集以及使用分子动态模拟的案例研究间接验证了模型的可转移性,显示出超越微生物蛋白质应用于食品和农业相关蛋白质的潜力。
公众号