Computational Methods for Protein Identification from Mass Spectrometry Data

Protein identification using mass spectrometry is an indispensable computational tool in the life sciences. A dramatic increase in the use of proteomic strategies to understand the biology of living systems generates an ongoing need for more effective, efficient, and accurate computational methods for protein identification. A wide range of computational methods, each with various implementations, are available to complement different proteomic approaches. A solid knowledge of the range of algorithms available and, more critically, the accuracy and effectiveness of these techniques is essential to ensure as many of the proteins as possible, within any particular experiment, are correctly identified. Here, we undertake a systematic review of the currently available methods and algorithms for interpreting, managing, and analyzing biological data associated with protein identification. We summarize the advances in computational solutions as they have responded to corresponding advances in mass spectrometry hardware. The evolution of scoring algorithms and metrics for automated protein identification are also discussed with a focus on the relative performance of different techniques. We also consider the relative advantages and limitations of different techniques in particular biological contexts. Finally, we present our perspective on future developments in the area of computational protein identification by considering the most recent literature on new and promising approaches to the problem as well as identifying areas yet to be explored and the potential application of methods from other areas of computational biology.

[1]  Rolf Apweiler,et al.  Annotating the Human Proteome , 2005, Molecular & Cellular Proteomics.

[2]  D. B. Weatherly,et al.  A Heuristic Method for Assigning a False-discovery Rate for Protein Identifications from Mascot Database Search Results * , 2005, Molecular & Cellular Proteomics.

[3]  Gary Stacey,et al.  Statistical assessment for mass-spec protein identification using peptide fingerprinting approach , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[4]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[5]  Marc R Wilkins,et al.  Using proteomics to mine genome sequences. , 2004, Journal of proteome research.

[6]  A. Shevchenko,et al.  The Power and the Limitations of Cross-Species Protein Identification by Mass Spectrometry-driven Sequence Similarity Searches*S , 2004, Molecular & Cellular Proteomics.

[7]  Joseph John Thomson,et al.  Rays of positive electricity and their application to chemical analyses , 1913 .

[8]  K. Resing,et al.  Proteomics strategies for protein identification , 2005, FEBS letters.

[9]  Joshua E. Elias,et al.  Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. , 2003, Journal of proteome research.

[10]  Gilbert S Omenn,et al.  An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis , 2005, Proteomics.

[11]  Anders Björk,et al.  Improved method for peak picking in matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. , 2004, Rapid communications in mass spectrometry : RCM.

[12]  Ruedi Aebersold,et al.  The Need for Guidelines in Publication of Peptide and Protein Identification Data , 2004, Molecular & Cellular Proteomics.

[13]  K. Biemann,et al.  Determination of the amino acid sequence in oligopeptides by computer interpretation of their high-resolution mass spectra. , 1966, Journal of the American Chemical Society.

[14]  W. R. Rays of Positive Electricity and their Application to Chemical Analysis , 1914, Nature.

[15]  R. Appel,et al.  Popitam: Towards new heuristic strategies to improve protein identification from tandem mass spectrometry data , 2003, Proteomics.

[16]  J. Silberring,et al.  An enhanced method for peptides sequencing by N‐terminal derivatization and MS , 2005, Proteomics.

[17]  Andreas Rizzi,et al.  Derivatization by 6-aminoquinolyl-N-hydroxysuccinimidyl carbamate for enhancing the ionization yield of small peptides and glycopeptides in matrix-assisted laser desorption/ionization and electrospray ionization mass spectrometry. , 2006, Rapid communications in mass spectrometry : RCM.

[18]  Suhua Chang,et al.  A novel scoring schema for peptide identification by searching protein sequence databases using tandem mass spectrometry data , 2006, BMC Bioinformatics.

[19]  Mark P. Molloy,et al.  Evaluation of Chemical Derivatisation Methods for Protein Identification using MALDI MS/MS , 2006, International Journal of Peptide Research and Therapeutics.

[20]  J. A. Taylor,et al.  Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. , 1997, Rapid communications in mass spectrometry : RCM.

[21]  A. Shevchenko,et al.  Expanding the organismal scope of proteomics: Cross‐species protein identification by mass spectrometry and its implications , 2003, Proteomics.

[22]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[23]  Ying Xu,et al.  A computational method for assessing peptide-identification reliability in tandem mass spectrometry analysis with SEQUEST , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[24]  E. Hubbard,et al.  Initiative , 2020, Encyclopedia of Creativity, Invention, Innovation and Entrepreneurship.

[25]  Conrad Bessant,et al.  Protein and peptide identification algorithms using MS for use in high‐throughput, automated pipelines , 2005, Proteomics.

[26]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[27]  David Fenyö,et al.  Probity: a protein identification algorithm with accurate assignment of the statistical significance of the results. , 2004, Journal of proteome research.

[28]  David Fenyö,et al.  RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database , 2002, Proteomics.

[29]  P. Højrup,et al.  Rapid identification of proteins by peptide-mass fingerprinting , 1993, Current Biology.

[30]  Stephen B. Vardeman,et al.  Calibration, Error Analysis, and Ongoing Measurement Process Monitoring for Mass Spectrometry , 2006 .

[31]  A. Burlingame,et al.  Functional Assignment of the 20 S Proteasome from Trypanosoma brucei Using Mass Spectrometry and New Bioinformatics Approaches* , 2001, The Journal of Biological Chemistry.

[32]  Hugh M. Cartwright,et al.  msmsEval: tandem mass spectral quality assignment for high-throughput proteomics , 2007, BMC Bioinformatics.

[33]  Brian C. Searle,et al.  P17-M Improving Sensitivity by Combining Results from Multiple MS/MS Search Methodologies with the Scaffold Computer Algorithm. , 2007 .

[34]  Mikhail M Savitski,et al.  Improving Protein Identification Using Complementary Fragmentation Techniques in Fourier Transform Mass Spectrometry* , 2005, Molecular & Cellular Proteomics.

[35]  Niclas G Karlsson,et al.  Development of a mass fingerprinting tool for automated interpretation of oligosaccharide fragmentation data , 2004, Proteomics.

[36]  Helmut E Meyer,et al.  Applications of highly sensitive phosphopeptide derivatization methods without the need for organic solvents , 2006, Proteomics.

[37]  Daniel A. Schaeffer,et al.  Error‐tolerant EST database searches by tandem mass spectrometry and multiTag software , 2005, Proteomics.

[38]  Zhongqi Zhang Prediction of low-energy collision-induced dissociation spectra of peptides. , 2004, Analytical chemistry.

[39]  Robertson Craig,et al.  The use of proteotypic peptide libraries for protein identification. , 2005, Rapid communications in mass spectrometry : RCM.

[40]  Yingming Zhao,et al.  Integrated approach for manual evaluation of peptides identified by searching protein sequence databases with tandem mass spectra. , 2005, Journal of proteome research.

[41]  K. Stühler,et al.  Evaluation of algorithms for protein identification from sequence databases using mass spectrometry data , 2004, Proteomics.

[42]  R. Appel,et al.  Guidelines for the next 10 years of proteomics , 2009, Proteomics.

[43]  Fredrik Levander,et al.  Automated protein identification by the combination of MALDI MS and MS/MS spectra from different instruments. , 2005, Journal of proteome research.

[44]  Jeffrey S. Morris,et al.  Improved peak detection and quantification of mass spectrometry data acquired from surface‐enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform , 2005, Proteomics.

[45]  Alun Preece,et al.  Universal Metrics for Quality Assessment of Protein Identifications by Mass Spectrometry* , 2006, Molecular & Cellular Proteomics.

[46]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[47]  M. Wilkins,et al.  Cross-species protein identification using amino acid composition, peptide mass fingerprinting, isoelectric point and molecular mass: a theoretical evaluation. , 1997, Journal of theoretical biology.

[48]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[49]  D. Hochstrasser,et al.  Peptide mass fingerprinting peak intensity prediction: Extracting knowledge from spectra , 2002, Proteomics.

[50]  K. Resing,et al.  Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. , 2004, Analytical chemistry.

[51]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[52]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[53]  W. Wodzig,et al.  Standardization of calibration and quality control using surface enhanced laser desorption ionization-time of flight-mass spectrometry. , 2006, Clinica chimica acta; international journal of clinical chemistry.

[54]  B. Chait,et al.  A statistical basis for testing the significance of mass spectrometric protein identification results. , 2000, Analytical chemistry.

[55]  Andrey Gorin,et al.  PPM-chain - de novo peptide identification program comparable in performance to Sequest , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[56]  Eugene Kolker,et al.  Charge state estimation for tandem mass spectrometry proteomics. , 2005, Omics : a journal of integrative biology.

[57]  Peer Bork,et al.  Homology‐based functional proteomics by mass spectrometry: Application to the Xenopus microtubule‐associated proteome , 2004, Proteomics.

[58]  Roger E. Moore,et al.  Qscore: An algorithm for evaluating SEQUEST database search results , 2002, Journal of the American Society for Mass Spectrometry.

[59]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[60]  Ilan Beer,et al.  Improving large‐scale proteomics by clustering of mass spectrometry data , 2004, Proteomics.

[61]  Eugene A. Kapp,et al.  Overview of the HUPO Plasma Proteome Project: Results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly‐available database , 2005, Proteomics.

[62]  J. Beynon,et al.  The use of the mass spectrometer for the identification of organic compounds , 1956 .

[63]  Bin Ma,et al.  SPIDER: software for protein identification from sequence tags with de novo sequencing error , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[64]  Aaron J Mackey,et al.  Getting More from Less , 2002, Molecular & Cellular Proteomics.

[65]  Vlado Dančík,et al.  De Novo peptide sequencing via tandem mass spectrometry: a graph-theoretical approach , 1999, RECOMB.

[66]  C. A. Hastings,et al.  New algorithms for processing and peak detection in liquid chromatography/mass spectrometry data. , 2002, Rapid communications in mass spectrometry : RCM.

[67]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[68]  M. Baldwin Protein Identification by Mass Spectrometry , 2004, Molecular & Cellular Proteomics.

[69]  John T. Stults,et al.  Protein identification: The origins of peptide mass fingerprinting , 2003, Journal of the American Society for Mass Spectrometry.

[70]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[71]  Wen Gao,et al.  An SVM Scorer for More Sensitive and Reliable Peptide Identification via Tandem Mass Spectrometry , 2006, Pacific Symposium on Biocomputing.

[72]  Eric D. Salin,et al.  Evaluation of the simultaneous use of standard additions and internal standards calibration techniques for inductively coupled plasma mass spectrometry , 2004 .

[73]  Fredrik Levander,et al.  Modular, scriptable and automated analysis tools for high-throughput peptide mass fingerprinting , 2004, Bioinform..

[74]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[75]  Denis F. Hochstrasser,et al.  Clinical and Biomedical Applications of Proteomics , 1997 .

[76]  Pavel A. Pevzner,et al.  Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry , 2005, RECOMB.

[77]  B. Chait,et al.  ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. , 2000, Analytical chemistry.

[78]  J. Yates,et al.  GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. , 2003, Analytical chemistry.

[79]  Predrag Radivojac,et al.  A Machine Learning Approach to Predicting Peptide Fragmentation Spectra , 2005, Pacific Symposium on Biocomputing.

[80]  Paul A Haynes,et al.  Verification of single-peptide protein identifications by the application of complementary database search algorithms. , 2006, Journal of biomolecular techniques : JBT.

[81]  Christoph Menzel,et al.  OLAV-PMF: a novel scoring scheme for high-throughput peptide mass fingerprinting. , 2004, Journal of proteome research.

[82]  Steven P Gygi,et al.  Intensity-based protein identification by machine learning from a library of tandem mass spectra , 2004, Nature Biotechnology.

[83]  S. Cordwell,et al.  Evaluation of algorithms used for cross‐species proteome characterisation , 1997, Electrophoresis.

[84]  A. Nesvizhskii,et al.  Experimental protein mixture for validating tandem mass spectral analysis. , 2002, Omics : a journal of integrative biology.

[85]  N. Anderson,et al.  The Human Plasma Proteome , 2002, Molecular & Cellular Proteomics.

[86]  Marc R. Wilkins,et al.  Proteome Research: New Frontiers in Functional Genomics , 1997, Principles and Practice.