A posteriori quality control for the curation and reuse of public proteomics data

Proteomics is a rapidly expanding field encompassing a multitude of complex techniques and data types. To date much effort has been devoted to achieving the highest possible coverage of proteomes with the aim to inform future developments in basic biology as well as in clinical settings. As a result, growing amounts of data have been deposited in publicly available proteomics databases. These data are in turn increasingly reused for orthogonal downstream purposes such as data mining and machine learning. These downstream uses however, need ways to a posteriori validate whether a particular data set is suitable for the envisioned purpose. Furthermore, the (semi‐)automatic curation of repository data is dependent on analyses that can highlight misannotation and edge conditions for data sets. Such curation is an important prerequisite for efficient proteomics data reuse in the life sciences in general. We therefore present here a selection of quality control metrics and approaches for the a posteriori detection of potential issues encountered in typical proteomics data sets. We illustrate our metrics by relying on publicly available data from the Proteomics Identifications Database (PRIDE), and simultaneously show the usefulness of the large body of PRIDE data as a means to derive empirical background distributions for relevant metrics.

[1]  Lennart Martens,et al.  Analysis of the experimental detection of central nervous system‐related genes in human brain and cerebrospinal fluid datasets , 2008, Proteomics.

[2]  Ruedi Aebersold,et al.  Quantitative analysis of protein complex constituents and their phosphorylation states on a LTQ-Orbitrap instrument. , 2010, Journal of proteome research.

[3]  Nichole L. King,et al.  Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry , 2004, Genome Biology.

[4]  Ron Edgar,et al.  NCBI Peptidome: a new public repository for mass spectrometry peptide identifications , 2009, Nature Biotechnology.

[5]  Eugene A. Kapp,et al.  Overview of the HUPO Plasma Proteome Project: Results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly‐available database , 2005, Proteomics.

[6]  Ronald G. Tompkins,et al.  High Dynamic Range Characterization of the Trauma Patient Plasma Proteome*S , 2006, Molecular & Cellular Proteomics.

[7]  Lennart Martens,et al.  Analyzing large-scale proteomics projects with latent semantic indexing. , 2008, Journal of proteome research.

[8]  N. Ahn,et al.  Quantifying the impact of chimera MS/MS spectra on peptide identification in large-scale proteomics studies. , 2010, Journal of proteome research.

[9]  Daniel B. Martin,et al.  Computational prediction of proteotypic peptides for quantitative proteomics , 2007, Nature Biotechnology.

[10]  Lennart Martens,et al.  PRIDE: The proteomics identifications database , 2005, Proteomics.

[11]  Lennart Martens,et al.  A guide to the Proteomics Identifications Database proteomics data repository , 2009, Proteomics.

[12]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[13]  D. Chelius,et al.  Quantitative profiling of proteins in complex mixtures using liquid chromatography and mass spectrometry. , 2002, Journal of proteome research.

[14]  Luis Mendoza,et al.  MaRiMba: a software application for spectral library-based MRM transition list assembly. , 2009, Journal of proteome research.

[15]  David L. Tabb,et al.  Performance Metrics for Liquid Chromatography-Tandem Mass Spectrometry Systems in Proteomics Analyses* , 2009, Molecular & Cellular Proteomics.

[16]  Robert E. Kearney,et al.  A HUPO test sample study reveals common problems in mass spectrometry-based proteomics , 2009, Nature Methods.

[17]  K. Parker,et al.  Multiplexed Protein Quantitation in Saccharomyces cerevisiae Using Amine-reactive Isobaric Tagging Reagents*S , 2004, Molecular & Cellular Proteomics.

[18]  Birgit Schilling,et al.  Repeatability and reproducibility in proteomic identifications by liquid chromatography-tandem mass spectrometry. , 2010, Journal of proteome research.

[19]  Hyojik Yang,et al.  Pressure-assisted tryptic digestion using a syringe. , 2010, Rapid communications in mass spectrometry : RCM.

[20]  C. Gelfand,et al.  Inhibition of intrinsic proteolytic activities moderates preanalytical variability and instability of human plasma. , 2007, Journal of proteome research.

[21]  Robertson Craig,et al.  Open source system for analyzing, validating, and storing protein identification data. , 2004, Journal of proteome research.

[22]  Ruedi Aebersold,et al.  Building consensus spectral libraries for peptide identification in proteomics , 2008, Nature Methods.

[23]  David T. Kaleta,et al.  Enhanced Detection of Low Abundance Human Plasma Proteins Using a Tandem IgY12-SuperMix Immunoaffinity Separation Strategy*S , 2008, Molecular & Cellular Proteomics.

[24]  M. Mann,et al.  Decoding signalling networks by mass spectrometry-based proteomics , 2010, Nature Reviews Molecular Cell Biology.

[25]  R. Whittal,et al.  Interferences and contaminants encountered in modern mass spectrometry. , 2008, Analytica chimica acta.

[26]  Ruedi Aebersold,et al.  Targeted proteomic strategy for clinical biomarker discovery , 2009, Molecular oncology.

[27]  Steven A Carr,et al.  Protein biomarker discovery and validation: the long and uncertain path to clinical utility , 2006, Nature Biotechnology.

[28]  R. Service Proteomics Ponders Prime Time , 2008, Science.

[29]  Andrew H. Thompson,et al.  Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. , 2003, Analytical chemistry.

[30]  D. Chelius,et al.  Identification and relative quantitation of protein mixtures by enzymatic digestion followed by capillary reversed-phase liquid chromatography-tandem mass spectrometry. , 2002, Analytical chemistry.