关键词: ProteomeXchange benchmark big data clustering consensus spectra mass spectrometry pride database spectral libraries

Mesh : Algorithms Cluster Analysis Consensus Databases, Protein Proteomics / methods Software Tandem Mass Spectrometry / methods

来  源:   DOI:10.1021/acs.jproteome.2c00069

Abstract:
Spectrum clustering is a powerful strategy to minimize redundant mass spectra by grouping them based on similarity, with the aim of forming groups of mass spectra from the same repeatedly measured analytes. Each such group of near-identical spectra can be represented by its so-called consensus spectrum for downstream processing. Although several algorithms for spectrum clustering have been adequately benchmarked and tested, the influence of the consensus spectrum generation step is rarely evaluated. Here, we present an implementation and benchmark of common consensus spectrum algorithms, including spectrum averaging, spectrum binning, the most similar spectrum, and the best-identified spectrum. We have analyzed diverse public data sets using two different clustering algorithms (spectra-cluster and MaRaCluster) to evaluate how the consensus spectrum generation procedure influences downstream peptide identification. The BEST and BIN methods were found the most reliable methods for consensus spectrum generation, including for data sets with post-translational modifications (PTM) such as phosphorylation. All source code and data of the present study are freely available on GitHub at https://github.com/statisticalbiotechnology/representative-spectra-benchmark.
摘要:
谱聚类是一种强大的策略,通过基于相似性对冗余质谱进行分组来最小化冗余质谱,目的是从相同的重复测量的分析物形成质谱组。每个这样的接近相同的光谱组可以由其用于下游处理的所谓的一致光谱来表示。尽管已经对频谱聚类的几种算法进行了充分的基准测试和测试,很少评估共识谱生成步骤的影响。这里,我们提出了常见共识谱算法的实现和基准,包括频谱平均,光谱分级,最相似的光谱,和最佳识别光谱。我们已经使用两种不同的聚类算法(光谱聚类和MaRaCluster)分析了不同的公共数据集,以评估共识光谱生成程序如何影响下游肽识别。BEST和BIN方法被认为是产生共识谱的最可靠方法,包括具有翻译后修饰(PTM)如磷酸化的数据集。本研究的所有源代码和数据均可在GitHub上免费获得,网址为https://github.com/statisticalbietrics/representative-spectrans-benchmark。
公众号