A nested mixture model for protein identification using mass spectrometry

Mass spectrometry provides a high-throughput way to identify proteins in biological samples. In a typical experiment, proteins in a sample are first broken into their constituent peptides. The resulting mixture of peptides is then subjected to mass spectrometry, which generates thousands of spectra, each characteristic of its generating peptide. Here we consider the problem of inferring, from these spectra, which proteins and peptides are present in the sample. We develop a statistical approach to the problem, based on a nested mixture model. In contrast to commonly used two-stage approaches, this model provides a one-stage solution that simultaneously identifies which proteins are present, and which peptides are correctly identified. In this way our model incorporates the evidence feedback between proteins and their constituent peptides. Using simulated data and a yeast data set, we compare and contrast our method with existing widely used approaches (Peptide-Prophet/ProteinProphet) and with a recently published new approach, HSM. For peptide identification, our single-stage approach yields consistently more accurate results. For protein identification the methods have similar accuracy in most settings, although we exhibit some scenarios in which the existing methods perform poorly.

[1]  J. Shabanowitz,et al.  Tandem mass spectrometry for peptide and protein sequence analysis. , 2005, BioTechniques.

[2]  J. Yates,et al.  A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. , 2003, Analytical chemistry.

[3]  Lang Li,et al.  A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry , 2008, Bioinform..

[4]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[5]  Hyungwon Choi,et al.  Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. , 2008, Journal of proteome research.

[6]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[7]  A. Nesvizhskii,et al.  Experimental protein mixture for validating tandem mass spectral analysis. , 2002, Omics : a journal of integrative biology.

[8]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[11]  D. Naiman,et al.  Probability model for assessing proteins assembled from peptide sequences inferred from tandem mass spectrometry data. , 2007, Analytical chemistry.

[12]  R. Aebersold,et al.  Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. , 2004, Drug discovery today.

[13]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[14]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  J. Yates,et al.  DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. , 2002, Journal of proteome research.

[17]  Rovshan G Sadygov,et al.  Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book , 2004, Nature Methods.

[18]  J. Yates,et al.  Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. , 2004, Analytical chemistry.

[19]  Steven P Gygi,et al.  Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations , 2005, Nature Methods.

[20]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.

[21]  E. Kolker,et al.  Standard mixtures for proteome studies. , 2004, Omics : a journal of integrative biology.

[22]  M. Mann,et al.  The abc's (and xyz's) of peptide sequencing , 2004, Nature Reviews Molecular Cell Biology.

[23]  Tom S. Price,et al.  EBP, a Program for Protein Identification Using Multiple Tandem Mass Spectrometry Datasets*S , 2007, Molecular & Cellular Proteomics.

[24]  A. Namane,et al.  Protein sequencing and identification using tandem mass spectrometry. Edited by Michael Kinter, Nicholas E. Sherman, published by Wiley-Interscience Series on Mass Spectrometry, 2000, 301 p. , 2002 .

[25]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.