关键词: FDR NGS ORFs RNA-Seq false discovery rate gene annotation mass spectrometry novel peptides proteogenomics shotgun proteomics variants

Mesh : Databases, Protein Nucleotides Peptides / chemistry Proteogenomics / methods Proteome Proteomics / methods

来  源:   DOI:10.1093/bib/bbac163

Abstract:
Proteogenomics refers to the integrated analysis of the genome and proteome that leverages mass-spectrometry (MS)-based proteomics data to improve genome annotations, understand gene expression control through proteoforms and find sequence variants to develop novel insights for disease classification and therapeutic strategies. However, proteogenomic studies often suffer from reduced sensitivity and specificity due to inflated database size. To control the error rates, proteogenomics depends on the target-decoy search strategy, the de-facto method for false discovery rate (FDR) estimation in proteomics. The proteogenomic databases constructed from three- or six-frame nucleotide database translation not only increase the search space and compute-time but also violate the equivalence of target and decoy databases. These searches result in poorer separation between target and decoy scores, leading to stringent FDR thresholds. Understanding these factors and applying modified strategies such as two-pass database search or peptide-class-specific FDR can result in a better interpretation of MS data without introducing additional statistical biases. Based on these considerations, a user can interpret the proteogenomics results appropriately and control false positives and negatives in a more informed manner. In this review, first, we briefly discuss the proteogenomic workflows and limitations in database construction, followed by various considerations that can influence potential novel discoveries in a proteogenomic study. We conclude with suggestions to counter these challenges for better proteogenomic data interpretation.
摘要:
蛋白质组学是指基因组和蛋白质组的综合分析,利用基于质谱(MS)的蛋白质组学数据来改善基因组注释,通过蛋白质形式了解基因表达控制,并发现序列变异,为疾病分类和治疗策略开发新的见解。然而,蛋白质基因组研究通常由于数据库大小膨胀而导致灵敏度和特异性降低。为了控制错误率,蛋白质基因组学依赖于目标诱饵搜索策略,蛋白质组学中错误发现率(FDR)估计的事实方法。从三或六帧核苷酸数据库翻译构建的蛋白质基因组数据库不仅增加了搜索空间和计算时间,而且违反了目标和诱饵数据库的等效性。这些搜索导致目标和诱饵得分之间的分离较差,导致严格的FDR阈值。了解这些因素并应用修改的策略,如两遍数据库搜索或肽类特异性FDR,可以更好地解释MS数据,而不会引入额外的统计偏差。基于这些考虑,用户可以适当地解释蛋白质基因组学结果,并以更知情的方式控制假阳性和阴性。在这次审查中,首先,我们简要讨论了蛋白质组学工作流程和数据库构建中的局限性,其次是各种可能影响蛋白质基因组研究中潜在新发现的考虑因素。最后,我们提出了应对这些挑战的建议,以更好地解释蛋白质基因组数据。
公众号