A Partial Set Covering Model for Protein Mixture Identification Using Mass Spectrometry Data

Protein identification is a key and essential step in mass spectrometry (MS) based proteome research. To date, there are many protein identification strategies that employ either MS data or MS/MS data for database searching. While MS-based methods provide wider coverage than MS/MS-based methods, their identification accuracy is lower since MS data have less information than MS/MS data. Thus, it is desired to design more sophisticated algorithms that achieve higher identification accuracy using MS data. Peptide Mass Fingerprinting (PMF) has been widely used to identify single purified proteins from MS data for many years. In this paper, we extend this technology to protein mixture identification. First, we formulate the problem of protein mixture identification as a Partial Set Covering (PSC) problem. Then, we present several algorithms that can solve the PSC problem efficiently. Finally, we extend the partial set covering model to both MS/MS data and the combination of MS data and MS/MS data. The experimental results on simulated data and real data demonstrate the advantages of our method: 1) it outperforms previous MS-based approaches significantly; 2) it is useful in the MS/MS-based protein inference; and 3) it combines MS data and MS/MS data in a unified model such that the identification performance is further improved.

[1]  B. Chait,et al.  ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. , 2000, Analytical chemistry.

[2]  C. Watanabe,et al.  Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[3]  David Fenyö,et al.  Probity: a protein identification algorithm with accurate assignment of the statistical significance of the results. , 2004, Journal of proteome research.

[4]  Quanhu Sheng,et al.  A Bayesian Approach to Protein Inference Problem in Shotgun Proteomics , 2008, RECOMB.

[5]  Conrad Bessant,et al.  Protein and peptide identification algorithms using MS for use in high‐throughput, automated pipelines , 2005, Proteomics.

[6]  James P. Reilly,et al.  Advancement in Protein Inference from Shotgun Proteomics Using Peptide Detectability , 2006, Pacific Symposium on Biocomputing.

[7]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[8]  Navdeep Jaitly,et al.  VIPER: an advanced software package to support high-throughput LC-MS peptide identification , 2007, Bioinform..

[9]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[10]  T. Hunkapiller,et al.  Peptide mass maps: a highly informative approach to protein identification. , 1993, Analytical biochemistry.

[11]  G. Gonnet,et al.  Protein identification by mass profile fingerprinting. , 1993, Biochemical and biophysical research communications.

[12]  Zengyou He,et al.  Peak bagging for peptide mass fingerprinting , 2008, Bioinform..

[13]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[14]  Fredrik Levander,et al.  Modular, scriptable and automated analysis tools for high-throughput peptide mass fingerprinting , 2004, Bioinform..

[15]  S. Khuller,et al.  Approximation algorithms for partial covering problems , 2001, J. Algorithms.

[16]  P. Højrup,et al.  Rapid identification of proteins by peptide-mass fingerprinting , 1993, Current Biology.

[17]  Laurent Brechenmacher,et al.  Development and assessment of scoring functions for protein identification using PMF data , 2007, Electrophoresis.

[18]  Jennifer A. Siepen,et al.  Prediction of missed cleavage sites in tryptic peptides aids protein identification in proteomics. , 2007, Journal of proteome research.

[19]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[20]  Leo C. McHugh,et al.  Computational Methods for Protein Identification from Mass Spectrometry Data , 2008, PLoS Comput. Biol..

[21]  Ojas Parekh,et al.  A Unified Approach to Approximating Partial Covering Problems , 2006, Algorithmica.

[22]  Christoph Menzel,et al.  OLAV-PMF: a novel scoring scheme for high-throughput peptide mass fingerprinting. , 2004, Journal of proteome research.

[23]  Julián Mestre Lagrangian Relaxation and Partial Cover , 2007, ArXiv.

[24]  A. Podtelejnikov,et al.  Identification of the components of simple protein mixtures by high-accuracy peptide mass mapping and database searching. , 1997, Analytical chemistry.

[25]  Toshihiro Fujito,et al.  On Combinatorial Approximation of Covering 0-1 Integer Programs and Partial Set Cover , 2004, J. Comb. Optim..

[26]  John T. Stults,et al.  Protein identification: The origins of peptide mass fingerprinting , 2003, Journal of the American Society for Mass Spectrometry.

[27]  D. Russell,et al.  Identification of individual proteins in complex protein mixtures by high-resolution, high-mass-accuracy MALDI TOF-mass spectrometry analysis of in-solution thermal denaturation/enzymatic digestion. , 2001, Analytical chemistry.

[28]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[29]  David Fenyö,et al.  Protein identification in complex mixtures. , 2005, Journal of proteome research.

[30]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[31]  D. Hochbaum Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems , 1996 .

[32]  Markus Bläser,et al.  Computing small partial coverings , 2003, Inf. Process. Lett..

[33]  Samuel H. Payne,et al.  Discovery and revision of Arabidopsis genes by proteogenomics , 2008, Proceedings of the National Academy of Sciences.

[34]  D. Tabb,et al.  Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. , 2007, Journal of proteome research.

[35]  John D. Venable,et al.  Improving protein identification sensitivity by combining MS and MS/MS information for shotgun proteomics using LTQ-Orbitrap high mass accuracy data. , 2008, Analytical chemistry.

[36]  Navdeep Jaitly,et al.  Decon2LS: An open-source software package for automated processing and visualization of high resolution mass spectrometry data , 2009, BMC Bioinformatics.

[37]  Morgan C. Giddings,et al.  High-accuracy peptide mass fingerprinting using peak intensity data with machine learning. , 2008, Journal of proteome research.

[38]  Petr Slavík Improved Performance of the Greedy Algorithm for Partial Cover , 1997, Inf. Process. Lett..

[39]  Dante Mantini,et al.  Independent component analysis for the extraction of reliable protein signal profiles from MALDI-TOF mass spectra , 2008, Bioinform..

[40]  P. Højrup,et al.  Use of mass spectrometric molecular weight information to identify proteins in sequence databases. , 1993, Biological mass spectrometry.

[41]  P. Zimmermann,et al.  Genome-Scale Proteomics Reveals Arabidopsis thaliana Gene Models and Proteome Dynamics , 2008, Science.