A review of statistical methods for protein identification using tandem mass spectrometry.

Tandem mass spectrometry has emerged as a powerful tool for the characterization of complex protein samples, an increasingly important problem in biology. The effort to efficiently and accurately perform inference on data from tandem mass spectrometry experiments has resulted in several statistical methods. We use a common framework to describe the predominant methods and discuss them in detail. These methods are classified using the following categories: set cover methods, iterative methods, and Bayesian methods. For each method, we analyze and evaluate the outcome and methodology of published comparisons to other methods; we use this comparison to comment on the qualities and weaknesses, as well as the overall utility, of all methods. We discuss the similarities between these methods and suggest directions for the field that would help unify these similar assumptions in a more rigorous manner and help enable efficient and reliable protein inference.

[1]  William Stafford Noble,et al.  Rapid and accurate peptide identification from tandem mass spectra. , 2008, Journal of proteome research.

[2]  D. Tabb,et al.  Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. , 2007, Journal of proteome research.

[3]  Jian Feng,et al.  Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies , 2007, Bioinform..

[4]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[5]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[6]  E. O’Shea,et al.  Quantification of protein half-lives in the budding yeast proteome , 2006, Proceedings of the National Academy of Sciences.

[7]  Quanhu Sheng,et al.  A Bayesian Approach to Protein Inference Problem in Shotgun Proteomics , 2008, RECOMB.

[8]  Lang Li,et al.  A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry , 2008, Bioinform..

[9]  Michael J. MacCoss,et al.  A nested mixture model for protein identification using mass spectrometry , 2010, 1011.2087.

[10]  B. Searle Scaffold: A bioinformatic tool for validating MS/MS‐based proteomic studies , 2010, Proteomics.

[11]  Miroslav Radman,et al.  Reassembly of shattered chromosomes in Deinococcus radiodurans , 2006, Nature.

[12]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[13]  R. Beavis,et al.  A method for reducing the time required to match protein sequences with tandem mass spectra. , 2003, Rapid communications in mass spectrometry : RCM.

[14]  A. Nesvizhskii A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. , 2010, Journal of proteomics.

[15]  Kristian G. Olesen,et al.  HUGIN - A Shell for Building Bayesian Belief Universes for Expert Systems , 1989, IJCAI.

[16]  G. H. Jacobs,et al.  Response to Comment on "Emergence of Novel Color Vision in Mice Engineered to Express a Human Cone Photopigment" , 2007, Science.

[17]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[18]  S. Gygi,et al.  Correlation between Protein and mRNA Abundance in Yeast , 1999, Molecular and Cellular Biology.

[19]  Daniel B. Martin,et al.  Computational prediction of proteotypic peptides for quantitative proteomics , 2007, Nature Biotechnology.

[20]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[21]  James P. Reilly,et al.  A computational approach toward label-free protein quantification using predicted peptide detectability , 2006, ISMB.

[22]  Yair Weiss,et al.  Correctness of Local Probability Propagation in Graphical Models with Loops , 2000, Neural Computation.

[23]  J. Yates,et al.  DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. , 2002, Journal of proteome research.

[24]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[25]  J. Mesirov,et al.  Prediction of high-responding peptides for targeted protein assays by mass spectrometry , 2009, Nature Biotechnology.

[26]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[27]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[28]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[29]  William Stafford Noble,et al.  Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. , 2010, Journal of proteome research.

[30]  Tom S. Price,et al.  EBP, a Program for Protein Identification Using Multiple Tandem Mass Spectrometry Datasets*S , 2007, Molecular & Cellular Proteomics.

[31]  Jeffery K. Taubenberger,et al.  Initial Genetic Characterization of the 1918 “Spanish” Influenza Virus , 1997, Science.

[32]  Gennifer E. Merrihew,et al.  Post analysis data acquisition for the iterative MS/MS sampling of proteomics mixtures. , 2009, Journal of proteome research.

[33]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[34]  Ruedi Aebersold,et al.  The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. , 2008, Journal of proteome research.

[35]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[36]  J. Buhmann,et al.  Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry* , 2009, Molecular & Cellular Proteomics.

[37]  M. Mann,et al.  The abc's (and xyz's) of peptide sequencing , 2004, Nature Reviews Molecular Cell Biology.