High-accuracy peptide mass fingerprinting using peak intensity data with machine learning.

For MALDI-TOF mass spectrometry, we show that the intensity of a peptide-ion peak is directly correlated with its sequence, with the residues M, H, P, R, and L having the most substantial effect on ionization. We developed a machine learning approach that exploits this relationship to significantly improve peptide mass fingerprint (PMF) accuracy based on training data sets from both true-positive and false-positive PMF searches. The model's cross-validated accuracy in distinguishing real versus false-positive database search results is 91%, rivaling the accuracy of MS/MS-based protein identification.

[1]  D. Hochstrasser,et al.  Peptide mass fingerprinting peak intensity prediction: Extracting knowledge from spectra , 2002, Proteomics.

[2]  B. Chait,et al.  A statistical basis for testing the significance of mass spectrometric protein identification results. , 2000, Analytical chemistry.

[3]  B. Chait,et al.  Modification of cysteine residues by alkylation. A tool in peptide mapping and protein identification. , 1998, Analytical chemistry.

[4]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[5]  A Bairoch,et al.  Multiple parameter cross‐species protein identification using MultiIdent ‐ a world‐wide web accessible tool , 1998, Electrophoresis.

[6]  H. Wenschuh,et al.  The dominance of arginine-containing peptides in MALDI-derived tryptic mass fingerprints of proteins. , 1999, Analytical chemistry.

[7]  Peter D. Karp,et al.  The EcoCyc Database , 2002, Nucleic Acids Res..

[8]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[9]  T. Annesley Ion suppression in mass spectrometry. , 2003, Clinical chemistry.

[10]  Timothy Olah,et al.  Mechanistic investigation of ionization suppression in electrospray ionization , 2000, Journal of the American Society for Mass Spectrometry.

[11]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[12]  Morgan C. Giddings,et al.  GFSWeb: a web tool for genome-based identification of proteins from mass spectrometric samples. , 2004, Journal of Proteome Research.

[13]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[14]  Barry Moore,et al.  Genome-based peptide fingerprint scanning , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  P. Højrup,et al.  Use of mass spectrometric molecular weight information to identify proteins in sequence databases. , 1993, Biological mass spectrometry.

[16]  M. Mann,et al.  Trypsin Cleaves Exclusively C-terminal to Arginine and Lysine Residues*S , 2004, Molecular & Cellular Proteomics.

[17]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[18]  A. Agresti,et al.  Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions , 1998 .