Protein and peptide identification algorithms using MS for use in high‐throughput, automated pipelines

Current proteomics experiments can generate vast quantities of data very quickly, but this has not been matched by data analysis capabilities. Although there have been a number of recent reviews covering various aspects of peptide and protein identification methods using MS, comparisons of which methods are either the most appropriate for, or the most effective at, their proposed tasks are not readily available. As the need for high‐throughput, automated peptide and protein identification systems increases, the creators of such pipelines need to be able to choose algorithms that are going to perform well both in terms of accuracy and computational efficiency. This article therefore provides a review of the currently available core algorithms for PMF, database searching using MS/MS, sequence tag searches and de novo sequencing. We also assess the relative performances of a number of these algorithms. As there is limited reporting of such information in the literature, we conclude that there is a need for the adoption of a system of standardised reporting on the performance of new peptide and protein identification algorithms, based upon freely available datasets. We go on to present our initial suggestions for the format and content of these datasets.

[1]  Albert Sickmann,et al.  Challenges in mass spectrometry‐based proteomics , 2004, Proteomics.

[2]  John T. Stults,et al.  Protein identification: The origins of peptide mass fingerprinting , 2003, Journal of the American Society for Mass Spectrometry.

[3]  T A Thanaraj,et al.  Categorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human. , 2002, Human molecular genetics.

[4]  S. A. McLuckey,et al.  Whole protein dissociation in a quadrupole ion trap: identification of an a priori unknown modified protein. , 2004, Analytical chemistry.

[5]  E. Stauber,et al.  A new approach that allows identification of intron‐split peptides from mass spectrometric data in genomic databases , 2004, FEBS letters.

[6]  A. Nesvizhskii,et al.  Experimental protein mixture for validating tandem mass spectral analysis. , 2002, Omics : a journal of integrative biology.

[7]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[8]  Eugene A. Kapp,et al.  Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation. , 2003, Analytical chemistry.

[9]  John R Yates,et al.  Large-scale protein identification using mass spectrometry. , 2003, Biochimica et biophysica acta.

[10]  D. Figeys Proteomics in 2002: a year of technical development and wide-ranging applications. , 2003, Analytical chemistry.

[11]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[12]  A specialised proteomic database for comparing matrix‐assisted laser desorption/ionization‐time of flight mass spectrometry data of tryptic peptides with corresponding sequence database segments , 2001, Proteomics.

[13]  A. Masselot,et al.  OLAV: Towards high‐throughput tandem mass spectrometry data identification , 2003, Proteomics.

[14]  Z. Smilansky,et al.  Intensity-based statistical scorer for tandem mass spectrometry. , 2003, Analytical chemistry.

[15]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[16]  Mikhail S. Gelfand,et al.  Pro-Frame: similarity-based gene recognition in eukaryotic DNA sequences with errors , 2001, Bioinform..

[17]  D. Creasy,et al.  Error tolerant searching of uninterpreted tandem mass spectrometry data , 2002, Proteomics.

[18]  J. Reilly,et al.  Artifacts and unassigned masses encountered in peptide mass mapping. , 2002, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[19]  J. Hirabayashi,et al.  Separation technologies for glycomics. , 2002, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[20]  Ting Chen,et al.  Algorithms for de novo peptide sequencing using tandem mass spectrometry , 2004 .

[21]  Jennifer M. Campbell,et al.  The characteristics of peptide collision-induced dissociation using a high-performance MALDI-TOF/TOF tandem mass spectrometer. , 2000, Analytical chemistry.

[22]  Michael Barber,et al.  Fast atom bombardment of solids (F.A.B.): a new ion source for mass spectrometry , 1981 .

[23]  Marc R. Wilkins,et al.  Proteome Research: New Frontiers in Functional Genomics , 1997, Principles and Practice.

[24]  B. Chait,et al.  ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. , 2000, Analytical chemistry.

[25]  J. Yates,et al.  A model for random sampling and estimation of relative protein abundance in shotgun proteomics. , 2004, Analytical chemistry.

[26]  D. Chan,et al.  Bioinformatics strategies for proteomic profiling. , 2004, Clinical biochemistry.

[27]  David Fenyö,et al.  RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database , 2002, Proteomics.

[28]  Amos Bairoch,et al.  FindPept, a tool to identify unmatched masses in peptide mass fingerprinting protein identification , 2002, Proteomics.

[29]  R. Aebersold,et al.  Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. , 2004, Drug discovery today.

[30]  R. Bradshaw,et al.  Application of combined mass spectrometry and partial amino acid sequence to the identification of gel‐separated proteins , 1996, Electrophoresis.

[31]  David Fenyö,et al.  A model of random mass‐matching and its use for automated significance testing in mass spectrometric proteome analysis , 2002, Proteomics.

[32]  N. Goldstein,et al.  Peptides identify multiple hotspots within the ligand binding domain of the TNF receptor 2 , 2003, Proteome Science.

[33]  K. Parker Scoring methods in MALDI peptide mass fingerprinting: ChemScore, and the ChemApplex program , 2002, Journal of the American Society for Mass Spectrometry.

[34]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[35]  Chris Smith,et al.  Plant proteome analysis by mass spectrometry: principles, problems, pitfalls and recent developments. , 2004, Phytochemistry.

[36]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[37]  J. Yates,et al.  A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. , 2003, Analytical chemistry.

[38]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[39]  Barry Moore,et al.  Genome-based peptide fingerprint scanning , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[40]  R D Appel,et al.  Improving protein identification from peptide mass fingerprinting through a parameterized multi‐level scoring algorithm and an optimized peak detection , 1999, Electrophoresis.

[41]  C. Bessant,et al.  Determination of partial amino acid composition from tandem mass spectra for use in peptide identification strategies , 2005, Proteomics.

[42]  J. Yates,et al.  Probability-based validation of protein identifications using a modified SEQUEST algorithm. , 2002, Analytical chemistry.

[43]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[44]  Roger E. Moore,et al.  Qscore: An algorithm for evaluating SEQUEST database search results , 2002, Journal of the American Society for Mass Spectrometry.

[45]  P. James,et al.  Of genomes and proteomes. , 1997, Biochemical and biophysical research communications.

[46]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[47]  Marina Carbonaro,et al.  Proteomics: present and future in food quality evaluation , 2004 .

[48]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[49]  J. A. Taylor,et al.  Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. , 2001, Analytical chemistry.

[50]  Xian Chen,et al.  New computational approaches for de novo peptide sequencing from MS/MS experiments , 2002, Proc. IEEE.

[51]  P. Nelson,et al.  From genomics to proteomics: techniques and applications in cancer research. , 2001, Trends in cell biology.

[52]  J. Yates,et al.  GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. , 2003, Analytical chemistry.

[53]  Bernhard Spengler,et al.  De novo sequencing, peptide composition analysis, and composition-based sequencing: A new strategy employing accurate mass determination by fourier transform ion cyclotron resonance mass spectrometry , 2004, Journal of the American Society for Mass Spectrometry.

[54]  Juri Rappsilber,et al.  Experiences and perspectives of MALDI MS and MS/MS in proteomic research , 2003 .

[55]  B. Chait,et al.  Analysis of phosphorylated proteins and peptides by mass spectrometry. , 2001, Current opinion in chemical biology.

[56]  C. Freeman,et al.  Annotation of the Human Genome by High-Throughput Sequence Analysis of Naturally Occurring Proteins , 2004 .

[57]  D. Liebler,et al.  SALSA: a pattern recognition algorithm to detect electrophile-adducted peptides by automated evaluation of CID spectra in LC-MS-MS analyses. , 2001, Analytical chemistry.

[58]  Steven P Gygi,et al.  Intensity-based protein identification by machine learning from a library of tandem mass spectra , 2004, Nature Biotechnology.

[59]  T. Nyman The role of mass spectrometry in proteome studies. , 2001, Biomolecular engineering.

[60]  Christoph Menzel,et al.  OLAV-PMF: a novel scoring scheme for high-throughput peptide mass fingerprinting. , 2004, Journal of proteome research.

[61]  A. Shevchenko,et al.  MultiTag: multiple error-tolerant sequence tag search for the sequence-similarity identification of proteins by mass spectrometry. , 2003, Analytical chemistry.

[62]  D. Liebler,et al.  Peptide sequence motif analysis of tandem MS data with the SALSA algorithm. , 2002, Analytical chemistry.

[63]  B. Chait,et al.  A statistical basis for testing the significance of mass spectrometric protein identification results. , 2000, Analytical chemistry.

[64]  Jacques Colinge,et al.  A Systematic Statistical Analysis of Ion Trap Tandem Mass Spectra in View of Peptide Scoring , 2003, WABI.

[65]  D. Hochstrasser,et al.  Peptide mass fingerprinting peak intensity prediction: Extracting knowledge from spectra , 2002, Proteomics.

[66]  Aaron J Mackey,et al.  Getting More from Less , 2002, Molecular & Cellular Proteomics.

[67]  D. Knapp,et al.  Peptide sequence determination from high-energy collision-induced dissociation spectra using artificial neural networks , 1995, Journal of the American Society for Mass Spectrometry.

[68]  J. Groopman,et al.  Mass spectrometry for genotyping: an emerging tool for molecular medicine. , 2000, Molecular medicine today.

[69]  J R Yates,et al.  Emerging tandem-mass-spectrometry techniques for the rapid identification of proteins. , 1997, Trends in biotechnology.

[70]  D. Black Protein Diversity from Alternative Splicing A Challenge for Bioinformatics and Post-Genome Biology , 2000, Cell.

[71]  Andrew Emili,et al.  In silico proteome analysis to facilitate proteomics experiments using mass spectrometry , 2003, Proteome Science.

[72]  Sarka Beranova-Giorgianni,et al.  Proteome analysis by two-dimensional gel electrophoresis and mass spectrometry: strengths and limitations , 2003 .

[73]  I. Papayannopoulos,et al.  The interpretation of collision‐induced dissociation tandem mass spectra of peptides , 1996 .

[74]  N. Blom,et al.  Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence , 2004, Proteomics.

[75]  B. Spengler Post-source decay analysis in matrix-assisted laser desorption/ionization mass spectrometry of biomolecules† , 1997 .

[76]  Marc R. Wilkins,et al.  Protein Identification in Proteome Projects , 1997 .

[77]  U. Sauer High-throughput phenomics: experimental methods for mapping fluxomes. , 2004, Current opinion in biotechnology.

[78]  J. Yates,et al.  Investigative proteomics: Identification of an unknown plant virus from infected plants using mass spectrometry , 2003, Journal of the American Society for Mass Spectrometry.

[79]  K. Resing,et al.  Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. , 2004, Analytical chemistry.

[80]  Joachim Klose,et al.  Interpretation of mass spectrometry data for high-throughput proteomics , 2003, Analytical and bioanalytical chemistry.

[81]  W. Lehmann,et al.  Patchwork peptide sequencing: Extraction of sequence information from accurate mass data of peptide tandem mass spectra recorded at high resolution* , 2002, Proteomics.

[82]  J. Yates,et al.  Statistical characterization of ion trap tandem mass spectra from doubly charged tryptic peptides. , 2003, Analytical chemistry.

[83]  T. Speed,et al.  Deriving statistical models for predicting peptide tandem MS product ion intensities. , 2003, Biochemical Society transactions.

[84]  R. Aebersold,et al.  ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data , 2002, Proteomics.

[85]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.

[86]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[87]  Yong-Bin Kim,et al.  ProSight PTM: an integrated environment for protein identification and characterization by top-down mass spectrometry , 2004, Nucleic Acids Res..