关键词: DNA-binding proteins RNA-binding proteins protein–DNA binding protein–RNA binding protein–nucleic acids binding transcription factors

Mesh : Amino Acid Sequence Binding Sites / genetics Computational Biology / methods Consensus Sequence DNA / metabolism DNA-Binding Proteins / chemistry genetics metabolism Databases, Protein / statistics & numerical data Humans Machine Learning Molecular Sequence Data Protein Binding / genetics RNA / metabolism RNA-Binding Proteins / chemistry genetics metabolism Sequence Homology, Amino Acid

来  源:   DOI:10.1093/bib/bbv023   PDF(Sci-hub)

Abstract:
Motivated by the pressing need to characterize protein-DNA and protein-RNA interactions on large scale, we review a comprehensive set of 30 computational methods for high-throughput prediction of RNA- or DNA-binding residues from protein sequences. We summarize these predictors from several significant perspectives including their design, outputs and availability. We perform empirical assessment of methods that offer web servers using a new benchmark data set characterized by a more complete annotation that includes binding residues transferred from the same or similar proteins. We show that predictors of DNA-binding (RNA-binding) residues offer relatively strong predictive performance but they are unable to properly separate DNA- from RNA-binding residues. We design and empirically assess several types of consensuses and demonstrate that machine learning (ML)-based approaches provide improved predictive performance when compared with the individual predictors of DNA-binding residues or RNA-binding residues. We also formulate and execute first-of-its-kind study that targets combined prediction of DNA- and RNA-binding residues. We design and test three types of consensuses for this prediction and conclude that this novel approach that relies on ML design provides better predictive quality than individual predictors when tested on prediction of DNA- and RNA-binding residues individually. It also substantially improves discrimination between these two types of nucleic acids. Our results suggest that development of a new generation of predictors would benefit from using training data sets that combine both RNA- and DNA-binding proteins, designing new inputs that specifically target either DNA- or RNA-binding residues and pursuing combined prediction of DNA- and RNA-binding residues.
摘要:
由于迫切需要大规模表征蛋白质-DNA和蛋白质-RNA相互作用,我们回顾了一套完整的30种计算方法,用于高通量预测蛋白质序列中的RNA或DNA结合残基。我们从几个重要的角度总结了这些预测因子,包括它们的设计,输出和可用性。我们使用新的基准数据集对提供网络服务器的方法进行经验评估,该基准数据集的特征是更完整的注释,包括从相同或相似的蛋白质转移的结合残基。我们表明,DNA结合(RNA结合)残基的预测提供了相对较强的预测性能,但他们不能正确地从RNA结合残基分离DNA。我们设计并根据经验评估了几种类型的共识,并证明了与DNA结合残基或RNA结合残基的个体预测因子相比,基于机器学习(ML)的方法提供了改进的预测性能。我们还制定并执行了首次针对DNA和RNA结合残基的联合预测的研究。我们设计并测试了三种类型的共识,并得出结论,当分别对DNA和RNA结合残基的预测进行测试时,这种依赖于ML设计的新方法提供了比单个预测因子更好的预测质量。它还显著改善了这两种类型的核酸之间的区别。我们的结果表明,新一代预测因子的开发将受益于使用结合RNA和DNA结合蛋白的训练数据集,设计特异性靶向DNA或RNA结合残基的新输入,并追求DNA和RNA结合残基的组合预测。
公众号