关键词: Allergen Deep neural network Machine learning Sequence analysis Stacking framework

Mesh : Allergens / immunology chemistry Deep Learning Machine Learning Software Computational Biology / methods Humans Neural Networks, Computer

来  源:   DOI:10.1016/j.ijbiomac.2024.133085

Abstract:
Allergy is a hypersensitive condition in which individuals develop objective symptoms when exposed to harmless substances at a dose that would cause no harm to a \"normal\" person. Most current computational methods for allergen identification rely on homology or conventional machine learning using limited set of feature descriptors or validation on specific datasets, making them inefficient and inaccurate. Here, we propose SEP-AlgPro for the accurate identification of allergen protein from sequence information. We analyzed 10 conventional protein-based features and 14 different features derived from protein language models to gauge their effectiveness in differentiating allergens from non-allergens using 15 different classifiers. However, the final optimized model employs top 10 feature descriptors with top seven machine learning classifiers. Results show that the features derived from protein language models exhibit superior discriminative capabilities compared to traditional feature sets. This enabled us to select the most discriminatory baseline models, whose predicted outputs were aggregated and used as input to a deep neural network for the final allergen prediction. Extensive case studies showed that SEP-AlgPro outperforms state-of-the-art predictors in accurately identifying allergens. A user-friendly web server was developed and made freely available at https://balalab-skku.org/SEP-AlgPro/, making it a powerful tool for identifying potential allergens.
摘要:
过敏是一种过敏状态,其中个体在暴露于无害物质的剂量时出现客观症状,不会对“正常”人造成伤害。目前用于过敏原识别的大多数计算方法依赖于同源性或使用有限的特征描述符集或对特定数据集的验证的常规机器学习。使它们变得低效和不准确。这里,我们提出SEP-AlgPro用于从序列信息中准确鉴定过敏原蛋白。我们分析了10个传统的基于蛋白质的特征和来自蛋白质语言模型的14个不同特征,以评估它们在使用15个不同的分类器区分过敏原和非过敏原方面的有效性。然而,最终优化的模型采用了前10个特征描述符和前7个机器学习分类器。结果表明,与传统特征集相比,来自蛋白质语言模型的特征具有更好的判别能力。这使我们能够选择最具歧视性的基线模型,其预测输出被聚合并用作最终过敏原预测的深度神经网络的输入。广泛的案例研究表明,SEP-AlgPro在准确识别过敏原方面优于最先进的预测因子。开发了用户友好的Web服务器,并在https://balalab-skku.org/SEP-AlgPro/上免费提供,使其成为识别潜在过敏原的强大工具。
公众号