背景:哺乳动物免疫系统能够产生针对多种抗原的抗体,包括细菌,病毒,和毒素。重排的免疫球蛋白基因的超深DNA测序在促进我们对免疫反应的理解方面具有相当大的潜力。但由于缺乏高通量,基于序列的方法,用于预测给定免疫球蛋白识别的抗原。
目的:作为仅从序列数据预测抗体-抗原结合的一步,我们的目的是比较一系列机器学习方法,这些方法应用于抗体-抗原对的数据集,以便从序列数据中预测抗体-抗原结合.
方法:从蛋白质数据库和冠状病毒抗体数据库中提取用于训练和测试的数据,和额外的抗体-抗原对数据通过使用分子对接方案产生。几种机器学习方法,包括加权最近邻方法,BLOSUM62矩阵的最近邻方法,和随机森林方法,适用于这个问题。
结果:最终数据集包含1157种抗体和57种抗原,它们以5041种抗体-抗原对组合。通过使用BLOSUM62矩阵的最近邻方法获得了预测相互作用的最佳性能,这导致完整数据集约82%的准确率。这些结果提供了一个有用的参考框架,以及协议和考虑,用于抗体-抗原结合预测中的机器学习和数据集创建。
结论:比较了几种机器学习方法来预测来自蛋白质序列的抗体-抗原相互作用。数据集(CSV格式)和机器学习程序(用Python编码)都可以在GitHub上免费下载。
BACKGROUND: The mammalian immune system is able to generate antibodies against a huge variety of antigens, including bacteria, viruses, and toxins. The ultradeep DNA sequencing of rearranged immunoglobulin genes has considerable potential in furthering our understanding of the immune response, but it is limited by the lack of a high-throughput, sequence-based method for predicting the antigen(s) that a given immunoglobulin recognizes.
OBJECTIVE: As a step toward the prediction of
antibody-antigen binding from sequence data alone, we aimed to compare a range of machine learning approaches that were applied to a collated data set of antibody-antigen pairs in order to predict
antibody-antigen binding from sequence data.
METHODS: Data for training and testing were extracted from the Protein Data Bank and the Coronavirus
Antibody Database, and additional
antibody-antigen pair data were generated by using a molecular docking protocol. Several machine learning methods, including the weighted nearest neighbor method, the nearest neighbor method with the BLOSUM62 matrix, and the random forest method, were applied to the problem.
RESULTS: The final data set contained 1157 antibodies and 57 antigens that were combined in 5041 antibody-antigen pairs. The best performance for the prediction of interactions was obtained by using the nearest neighbor method with the BLOSUM62 matrix, which resulted in around 82% accuracy on the full data set. These results provide a useful frame of reference, as well as protocols and considerations, for machine learning and data set creation in the prediction of antibody-antigen binding.
CONCLUSIONS: Several machine learning approaches were compared to predict
antibody-antigen interaction from protein sequences. Both the data set (in CSV format) and the machine learning program (coded in Python) are freely available for download on GitHub.