关键词: Compound protein interactions Deep learning Multi-modal Pre training

来  源:   DOI:10.1016/j.compbiolchem.2024.108137

Abstract:
BACKGROUND: Compound-protein interaction (CPI) prediction plays a crucial role in drug discovery and drug repositioning. Early researchers relied on time-consuming and labor-intensive wet laboratory experiments. However, the advent of deep learning has significantly accelerated this progress. Most existing deep learning methods utilize deep neural networks to extract compound features from sequences and graphs, either separately or in combination. Our team\'s previous research has demonstrated that compound images contain valuable information that can be leveraged for CPI task. However, there is a scarcity of multimodal methods that effectively combine sequence and image representations of compounds in CPI. Currently, the use of text-image pairs for contrastive language-image pre-training is a popular approach in the multimodal field. Further research is needed to explore how the integration of sequence and image representations can enhance the accuracy of CPI task.
RESULTS: This paper presents a novel method called MMCL-CPI, which encompasses two key highlights: 1) Firstly, we propose extracting compound features from two modalities: one-dimensional SMILES and two-dimensional images. This approach enables us to capture both sequence and spatial features, enhancing the prediction accuracy for CPI. Based on this, we design a novel multimodal model. 2) Secondly, we introduce a multimodal pre-training strategy that leverages comparative learning on a large-scale unlabeled dataset to establish the correspondence between SMILES string and compound\'s image. This pre-training approach significantly improves compound feature representations for downstream CPI task. Our method has shown competitive results on multiple datasets.
摘要:
背景:化合物-蛋白质相互作用(CPI)预测在药物发现和药物重新定位中起着至关重要的作用。早期的研究人员依赖于耗时且劳动密集型的湿式实验室实验。然而,深度学习的出现大大加快了这一进展。大多数现有的深度学习方法利用深度神经网络从序列和图形中提取复合特征,无论是单独或组合。我们团队先前的研究表明,复合图像包含有价值的信息,可用于CPI任务。然而,缺乏有效结合CPI中化合物的序列和图像表示的多模式方法。目前,使用文本图像对进行对比语言图像预训练是多模态领域的一种流行方法。需要进一步的研究来探索序列和图像表示的集成如何提高CPI任务的准确性。
结果:本文提出了一种称为MMCL-CPI的新方法,其中包括两个关键亮点:1)首先,我们建议从两种模式中提取复合特征:一维SMILES和二维图像。这种方法使我们能够捕获序列和空间特征,提高CPI预测精度。基于此,我们设计了一种新颖的多模态模型。(2)第二,我们引入了一种多模式预训练策略,该策略利用大规模无标记数据集上的比较学习来建立SMILES字符串和化合物图像之间的对应关系。这种预训练方法显著改善了下游CPI任务的复合特征表示。我们的方法在多个数据集上显示出竞争性结果。
公众号