关键词: Biological pipeline Data enhancing Machine learning Prostate cancer

Mesh : Male Humans Proteomics Prostate Prostatic Hyperplasia Prostatic Neoplasms / diagnosis Machine Learning Biomarkers Peptides

来  源:   DOI:10.1186/s12911-024-02491-6   PDF(Pubmed)

Abstract:
Proteomic-based analysis is used to identify biomarkers in blood samples and tissues. Data produced by devices such as mass spectrometry requires platforms to identify and quantify proteins (or peptides). Clinical information can be related to mass spectrometry data to identify diseases at an early stage. Machine learning techniques can be used to support physicians and biologists in studying and classifying pathologies. We present the application of machine learning techniques to define a pipeline aimed at studying and classifying proteomics data enriched using clinical information. The pipeline allows users to relate established blood biomarkers with clinical parameters and proteomics data. The proposed pipeline entails three main phases: (i) feature selection, (ii) models training, and (iii) models ensembling. We report the experience of applying such a pipeline to prostate-related diseases. Models have been trained on several biological datasets. We report experimental results about two datasets that result from the integration of clinical and mass spectrometry-based data in the contexts of serum and urine analysis. The pipeline receives input data from blood analytes, tissue samples, proteomic analysis, and urine biomarkers. It then trains different models for feature selection, classification and voting. The presented pipeline has been applied on two datasets obtained in a 2 years research project which aimed to extract hidden information from mass spectrometry, serum, and urine samples from hundreds of patients. We report results on analyzing prostate datasets serum with 143 samples, including 79 PCa and 84 BPH patients, and an urine dataset with 121 samples, including 67 PCa and 54 BPH patients. As results pipeline allowed to identify interesting peptides in the two datasets, 6 for the first one and 2 for the second one. The best model for both serum (AUC=0.87, Accuracy=0.83, F1=0.81, Sensitivity=0.84, Specificity=0.81) and urine (AUC=0.88, Accuracy=0.83, F1=0.83, Sensitivity=0.85, Specificity=0.80) datasets showed good predictive performances. We made the pipeline code available on GitHub and we are confident that it will be successfully adopted in similar clinical setups.
摘要:
基于蛋白质组学的分析用于鉴定血液样品和组织中的生物标志物。由诸如质谱的设备产生的数据需要平台来鉴定和量化蛋白质(或肽)。临床信息可以与质谱数据相关以在早期阶段识别疾病。机器学习技术可用于支持医生和生物学家研究和分类病理。我们介绍了机器学习技术的应用,以定义旨在研究和分类使用临床信息丰富的蛋白质组学数据的管道。管道允许用户将建立的血液生物标志物与临床参数和蛋白质组学数据相关联。拟议的管道需要三个主要阶段:(I)特征选择,(ii)模型训练,和(iii)模型组合。我们报告了将这种管道应用于前列腺相关疾病的经验。模型已经在几个生物数据集上进行了训练。我们报告了两个数据集的实验结果,这些数据集是在血清和尿液分析的背景下整合基于临床和质谱的数据。管道接收来自血液分析物的输入数据,组织样本,蛋白质组学分析,和尿液生物标志物。然后训练不同的模型进行特征选择,分类和投票。提出的管道已应用于2年的研究项目中获得的两个数据集,旨在从质谱中提取隐藏信息,血清,和数百名患者的尿液样本。我们报告了143份样本的前列腺数据集血清分析结果,包括79名PCa和84名BPH患者,和121个样本的尿液数据集,包括67例PCa和54例BPH患者。由于结果管道允许在两个数据集中识别有趣的肽,6为第一个,2为第二个。血清(AUC=0.87,准确性=0.83,F1=0.81,灵敏度=0.84,特异性=0.81)和尿液(AUC=0.88,准确性=0.83,F1=0.83,灵敏度=0.85,特异性=0.80)数据集的最佳模型显示出良好的预测性能。我们在GitHub上提供了管道代码,我们相信它将在类似的临床设置中成功采用。
公众号