Protein inference: a review

Assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Due to the existence of degenerate peptides and 'one-hit wonders', it is very difficult to determine which proteins are present in the sample. In this paper, we review existing protein inference methods and classify them according to the source of peptide identifications and the principle of algorithms. It is hoped that the readers will gain a good understanding of the current development in this field after reading this review and come up with new protein inference algorithms.

[1]  Xin Liu,et al.  A nonparametric model for quality control of database search results in shotgun proteomics , 2007, BMC Bioinformatics.

[2]  P. Pevzner,et al.  False discovery rates of protein identifications: a strike against the two-peptide rule. , 2009, Journal of proteome research.

[3]  Michael D. Litton,et al.  IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. , 2009, Journal of proteome research.

[4]  马斌 Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics , 2010 .

[5]  Chris F. Taylor,et al.  Data management and preliminary data analysis in the pilot phase of the HUPO Plasma Proteome Project , 2005, Proteomics.

[6]  Robin Kirschbaum,et al.  Questions and answers , 2009, Diabetes, obesity & metabolism.

[7]  Keiryn L. Bennett,et al.  Introduction to Computational Proteomics , 2007, PLoS Comput. Biol..

[8]  Quanhu Sheng,et al.  A Bayesian Approach to Protein Inference Problem in Shotgun Proteomics , 2008, RECOMB.

[9]  Patrick G. A. Pedrioli,et al.  A high-quality catalog of the Drosophila melanogaster proteome , 2007, Nature Biotechnology.

[10]  D Fenyö,et al.  Identifying the proteome: software tools. , 2000, Current opinion in biotechnology.

[11]  William Stafford Noble,et al.  Direct Maximization of Protein Identifications from Tandem Mass Spectra* , 2011, Molecular & Cellular Proteomics.

[12]  Lewis Y. Geer,et al.  DBParser: web-based software for shotgun proteomic data analyses. , 2004, Journal of proteome research.

[13]  Manfred Claassen,et al.  Design and Validation of Proteome Measurements , 2010, Ausgezeichnete Informatikdissertationen.

[14]  Fuchu He,et al.  Protein probabilities in shotgun proteomics: Evaluating different estimation methods using a semi‐random sampling model , 2006, Proteomics.

[15]  Leo C. McHugh,et al.  Computational Methods for Protein Identification from Mass Spectrometry Data , 2008, PLoS Comput. Biol..

[16]  H. Rehrauer,et al.  Deterministic protein inference for shotgun proteomics data provides new insights into Arabidopsis pollen development and function. , 2009, Genome research.

[17]  William Stafford Noble,et al.  Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. , 2010, Journal of proteome research.

[18]  Lennart Martens,et al.  Functional annotation of proteins identified in human brain during the HUPO Brain Proteome Project pilot study , 2006, Proteomics.

[19]  D. Tabb,et al.  Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. , 2007, Journal of proteome research.

[20]  Mehdi Mirzaei,et al.  Less label, more free: Approaches in label‐free quantitative mass spectrometry , 2011, Proteomics.

[21]  D. Naiman,et al.  Probability model for assessing proteins assembled from peptide sequences inferred from tandem mass spectrometry data. , 2007, Analytical chemistry.

[22]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[23]  J. Yates,et al.  Improving protein identification sensitivity by combining MS and MS/MS information for shotgun proteomics using LTQ-Orbitrap high mass accuracy data. , 2008, Analytical chemistry.

[24]  Roger E. Moore,et al.  Qscore: An algorithm for evaluating SEQUEST database search results , 2002, Journal of the American Society for Mass Spectrometry.

[25]  Lang Li,et al.  A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry , 2008, Bioinform..

[26]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[27]  Hyungwon Choi,et al.  Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. , 2008, Journal of Proteome Research.

[28]  J. Buhmann,et al.  Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry* , 2009, Molecular & Cellular Proteomics.

[29]  David L. Dowe,et al.  Bayes not Bust! Why Simplicity is no Problem for Bayesians1 , 2007, The British Journal for the Philosophy of Science.

[30]  Salvador Martínez-Bartolomé,et al.  Properties of Average Score Distributions of SEQUEST , 2008, Molecular & Cellular Proteomics.

[31]  Chao Yang,et al.  Optimization-Based Peptide Mass Fingerprinting for Protein Mixture Identification , 2008, RECOMB.

[32]  William Stafford Noble,et al.  Statistical calibration of the SEQUEST XCorr function. , 2009, Journal of proteome research.

[33]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[34]  Michael J. MacCoss,et al.  A nested mixture model for protein identification using mass spectrometry , 2010, 1011.2087.

[35]  D. B. Weatherly,et al.  A Heuristic Method for Assigning a False-discovery Rate for Protein Identifications from Mascot Database Search Results * , 2005, Molecular & Cellular Proteomics.

[36]  Daniel P. Miranker,et al.  Mining gene functional networks to improve mass-spectrometry-based protein identification , 2009, Bioinform..

[37]  Ruedi Aebersold,et al.  The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. , 2008, Journal of proteome research.

[38]  Bing Zhang,et al.  Network-assisted protein identification and data interpretation in shotgun proteomics , 2009, Molecular systems biology.

[39]  Zengyou He,et al.  A Partial Set Covering Model for Protein Mixture Identification Using Mass Spectrometry Data , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  P. Højrup,et al.  Rapid identification of proteins by peptide-mass fingerprinting , 1993, Current Biology.

[41]  Yi-Kuo Yu,et al.  Enhancing Peptide Identification Confidence by Combining Search Methods , 2008, Journal of proteome research.

[42]  A. Nesvizhskii A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. , 2010, Journal of proteomics.

[43]  C. Ahrens,et al.  PeptideClassifier for protein inference and targeted quantitative proteomics , 2010, Nature Biotechnology.

[44]  Hyungwon Choi,et al.  Significance Analysis of Spectral Count Data in Label-free Shotgun Proteomics*S , 2008, Molecular & Cellular Proteomics.

[45]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[46]  Tom S. Price,et al.  EBP, a Program for Protein Identification Using Multiple Tandem Mass Spectrometry Datasets*S , 2007, Molecular & Cellular Proteomics.

[47]  Lee Aaron Newberg,et al.  Exact Calculation of Distributions on Integers, with Application to Sequence Alignment , 2009, J. Comput. Biol..

[48]  K. Eng,et al.  Protein identification and Peptide expression resolver: harmonizing protein identification with protein expression data. , 2008, Journal of proteome research.

[49]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[50]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[51]  D. Hochbaum Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems , 1996 .

[52]  S. Markey,et al.  MassSieve: Panning MS/MS peptide data for proteins , 2010, Proteomics.

[53]  S. Gygi,et al.  Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[54]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[55]  R. Aebersold,et al.  A uniform proteomics MS/MS analysis platform utilizing open XML file formats , 2005, Molecular systems biology.

[56]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[57]  Knut Reinert,et al.  LC-MSsim – a simulation software for liquid chromatography mass spectrometry data , 2008, BMC Bioinformatics.

[58]  D. Creasy,et al.  Unimod: Protein modifications for mass spectrometry , 2004, Proteomics.

[59]  Christoph Menzel,et al.  OLAV-PMF: a novel scoring scheme for high-throughput peptide mass fingerprinting. , 2004, Journal of proteome research.

[60]  J. Mesirov,et al.  Prediction of high-responding peptides for targeted protein assays by mass spectrometry , 2009, Nature Biotechnology.

[61]  B. Searle Scaffold: A bioinformatic tool for validating MS/MS‐based proteomic studies , 2010, Proteomics.

[62]  Daniel B. Martin,et al.  Computational prediction of proteotypic peptides for quantitative proteomics , 2007, Nature Biotechnology.

[63]  Fang-Xiang Wu,et al.  Protein Inference by Assembling Peptides Identified from Tandem Mass Spectra , 2009 .

[64]  Marshall W. Bern,et al.  Improved Ranking Functions for Protein and Modification-Site Identifications , 2007, RECOMB.

[65]  Alexey I Nesvizhskii,et al.  Interpretation of Shotgun Proteomic Data , 2005, Molecular & Cellular Proteomics.

[66]  Linfeng Wu,et al.  Role of spectral counting in quantitative proteomics , 2010, Expert review of proteomics.

[67]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[68]  Fuchu He,et al.  Comparison of alternative analytical techniques for the characterisation of the human serum proteome in HUPO Plasma Proteome Project , 2005, Proteomics.

[69]  Lennart Martens,et al.  ms_lims, a simple yet powerful open source laboratory information management system for MS‐driven proteomics , 2010, Proteomics.

[70]  J. Yates,et al.  Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. , 2004, Analytical chemistry.

[71]  B. Chait,et al.  ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. , 2000, Analytical chemistry.

[72]  J. Yates,et al.  GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. , 2003, Analytical chemistry.

[73]  J. Yates,et al.  DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. , 2002, Journal of proteome research.

[74]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[75]  Haiyuan Yu,et al.  A Bayesian Mixture Model for Comparative Spectral Count Data in Shotgun Proteomics , 2011, Molecular & Cellular Proteomics.

[76]  Bulbul Chakravarti,et al.  Informatic tools for proteome profiling. , 2002, BioTechniques.

[77]  Joachim M. Buhmann,et al.  Proteome coverage prediction with infinite Markov models , 2009, Bioinform..

[78]  Joshua E. Elias,et al.  Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. , 2003, Journal of proteome research.

[79]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[80]  M. Hallett,et al.  Peptides you can count on , 2007, Nature Biotechnology.

[81]  Jie Ma,et al.  Bayesian Nonparametric Model for the Validation of Peptide Identification in Shotgun Proteomics*S , 2009, Molecular & Cellular Proteomics.

[82]  D. Fenyö,et al.  Improving the success rate of proteome analysis by modeling protein-abundance distributions and experimental designs , 2007, Nature Biotechnology.

[83]  Rong Wang,et al.  Integrating shotgun proteomics and mRNA expression data to improve protein identification , 2009, Bioinform..

[84]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[85]  James P. Reilly,et al.  Advancement in Protein Inference from Shotgun Proteomics Using Peptide Detectability , 2006, Pacific Symposium on Biocomputing.

[86]  D. Ghosh,et al.  Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling. , 2008, Journal of proteome research.

[87]  Vineet Bafna,et al.  InsPecT : Fast and accurate identification of post-translationally modified peptides from tandem mass spectra , 2005 .

[88]  Peter Bühlmann,et al.  Protein and gene model inference based on statistical modeling in k-partite graphs , 2010, Proceedings of the National Academy of Sciences.

[89]  Eugene Kolker,et al.  Estimating false discovery rates for peptide and protein identification using randomized databases , 2010, Proteomics.

[90]  Shamil R. Sunyaev,et al.  Assigning spectrum-specific P-values to protein identifications by mass spectrometry , 2011, Bioinform..

[91]  James P. Reilly,et al.  A computational approach toward label-free protein quantification using predicted peptide detectability , 2006, ISMB.

[92]  M. Mann,et al.  What does it mean to identify a protein in proteomics? , 2002, Trends in biochemical sciences.