Signatures for Mass Spectrometry Data Quality

Ensuring data quality and proper instrument functionality is a prerequisite for scientific investigation. Manual quality assurance is time-consuming and subjective. Metrics for describing liquid chromatography mass spectrometry (LC–MS) data have been developed; however, the wide variety of LC–MS instruments and configurations precludes applying a simple cutoff. Using 1150 manually classified quality control (QC) data sets, we trained logistic regression classification models to predict whether a data set is in or out of control. Model parameters were optimized by minimizing a loss function that accounts for the trade-off between false positive and false negative errors. The classifier models detected bad data sets with high sensitivity while maintaining high specificity. Moreover, the composite classifier was dramatically more specific than single metrics. Finally, we evaluated the performance of the classifier on a separate validation set where it performed comparably to the results for the testing/training data sets. By presenting the methods and software used to create the classifier, other groups can create a classifier for their specific QC regimen, which is highly variable lab-to-lab. In total, this manuscript presents 3400 LC–MS data sets for the same QC sample (whole cell lysate of Shewanella oneidensis), deposited to the ProteomeXchange with identifiers PXD000320–PXD000324.

[1]  Oliver Stegle,et al.  A Lasso multi-marker mixed model for association mapping with population structure correction , 2013, Bioinform..

[2]  Birgit Schilling,et al.  Interlaboratory Study Characterizing a Yeast Performance Standard for Benchmarking LC-MS Platform Performance* , 2009, Molecular & Cellular Proteomics.

[3]  Jian Huang,et al.  Incorporating group correlations in genome-wide association studies using smoothed group Lasso. , 2013, Biostatistics.

[4]  David L Tabb,et al.  Quality assessment for clinical proteomics. , 2013, Clinical biochemistry.

[5]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[6]  Joel G. Pounds,et al.  Improved quality control processing of peptide-centric LC-MS proteomics data , 2011, Bioinform..

[7]  John R Yates,et al.  Proteomics of Pyrococcus furiosus (Pfu): Identification of Extracted Proteins by Three Independent Methods. , 2013, Journal of proteome research.

[8]  P. Pevzner,et al.  The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search* , 2010, Molecular & Cellular Proteomics.

[9]  Lorenzo J. Vega-Montoto,et al.  QuaMeter: multivendor performance metrics for LC-MS/MS proteomics instrumentation. , 2012, Analytical chemistry.

[10]  David L. Tabb,et al.  Performance Metrics for Liquid Chromatography-Tandem Mass Spectrometry Systems in Proteomics Analyses* , 2009, Molecular & Cellular Proteomics.

[11]  Johannes Griss,et al.  The Proteomics Identifications (PRIDE) database and associated tools: status in 2013 , 2012, Nucleic Acids Res..

[12]  David Bramwell An introduction to statistical process control in research proteomics. , 2013, Journal of proteomics.