Feature-matching Pattern-based Support Vector Machines for Robust Peptide Mass Fingerprinting*

Peptide mass fingerprinting, regardless of becoming complementary to tandem mass spectrometry for protein identification, is still the subject of in-depth study because of its higher sample throughput, higher level of specificity for single peptides and lower level of sensitivity to unexpected post-translational modifications compared with tandem mass spectrometry. In this study, we propose, implement and evaluate a uniform approach using support vector machines to incorporate individual concepts and conclusions for accurate PMF. We focus on the inherent attributes and critical issues of the theoretical spectrum (peptides), the experimental spectrum (peaks) and spectrum (masses) alignment. Eighty-one feature-matching patterns derived from cleavage type, uniqueness and variable masses of theoretical peptides together with the intensity rank of experimental peaks were proposed to characterize the matching profile of the peptide mass fingerprinting procedure. We developed a new strategy including the participation of matched peak intensity redistribution to handle shared peak intensities and 440 parameters were generated to digitalize each feature-matching pattern. A high performance for an evaluation data set of 137 items was finally achieved by the optimal multi-criteria support vector machines approach, with 491 final features out of a feature vector of 35,640 normalized features through cross training and validating a publicly available “gold standard” peptide mass fingerprinting data set of 1733 items. Compared with the Mascot, MS-Fit, ProFound and Aldente algorithms commonly used for MS-based protein identification, the feature-matching patterns algorithm has a greater ability to clearly separate correct identifications and random matches with the highest values for sensitivity (82%), precision (97%) and F1-measure (89%) of protein identification. Several conclusions reached via this research make general contributions to MS-based protein identification. Firstly, inherent attributes showed comparable or even greater robustness than other explicit. As an inherent attribute of an experimental spectrum, peak intensity should receive considerable attention during protein identification. Secondly, alignment between intense experimental peaks and properly digested, unique or non-modified theoretical peptides is very likely to occur in positive peptide mass fingerprinting. Finally, normalization by several types of harmonic factors, including missed cleavages and mass modification, can make important contributions to the performance of the procedure.

[1]  Jennifer A. Siepen,et al.  Prediction of missed cleavage sites in tryptic peptides aids protein identification in proteomics. , 2007, Journal of proteome research.

[2]  K. Stühler,et al.  Evaluation of algorithms for protein identification from sequence databases using mass spectrometry data , 2004, Proteomics.

[3]  Tim W. Nattkemper,et al.  Peak intensity prediction in MALDI-TOF mass spectrometry: A machine learning study to support quantitative proteomics , 2008, BMC Bioinformatics.

[4]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[5]  Maureen Kachman,et al.  Validated MALDI-TOF/TOF mass spectra for protein standards , 2007, Journal of the American Society for Mass Spectrometry.

[6]  Laurent Brechenmacher,et al.  Development and assessment of scoring functions for protein identification using PMF data , 2007, Electrophoresis.

[7]  B. Chait,et al.  ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. , 2000, Analytical chemistry.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Zhang Siliang Automation Strategies for Bioinformatics Software and Their Implementation , 2011 .

[10]  Robertson Craig,et al.  The use of proteotypic peptide libraries for protein identification. , 2005, Rapid communications in mass spectrometry : RCM.

[11]  John T. Stults,et al.  Protein identification: The origins of peptide mass fingerprinting , 2003, Journal of the American Society for Mass Spectrometry.

[12]  R D Appel,et al.  Improving protein identification from peptide mass fingerprinting through a parameterized multi‐level scoring algorithm and an optimized peak detection , 1999, Electrophoresis.

[13]  Leo C. McHugh,et al.  Computational Methods for Protein Identification from Mass Spectrometry Data , 2008, PLoS Comput. Biol..

[14]  Dong Xu,et al.  Confidence assessment for protein identification by using peptide‐mass fingerprinting data , 2009, Proteomics.

[15]  Priyadharsini Nagarajan,et al.  Evaluating Peptide Mass Fingerprinting-based Protein Identification , 2008, Genom. Proteom. Bioinform..

[16]  P. Bork,et al.  Proteome survey reveals modularity of the yeast cell machinery , 2006, Nature.

[17]  Peter R. Baker,et al.  Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching. , 1999, Analytical chemistry.

[18]  Seon-Hwa Lee,et al.  A simple and efficient approach to improve protein identification by the peptide mass fingerprinting method: concomitant use of negative ionization , 2010 .

[19]  Ravi Tharakan,et al.  Data maximization by multipass analysis of protein mass spectra , 2010, Proteomics.

[20]  Flavio Monigatti,et al.  Algorithm for accurate similarity measurements of peptide mass fingerprints and its application , 2005, Journal of the American Society for Mass Spectrometry.

[21]  Joachim Klose,et al.  Interpretation of mass spectrometry data for high-throughput proteomics , 2003, Analytical and bioanalytical chemistry.

[22]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[23]  Rachana Jain,et al.  Kolmogorov-Smirnov scores and intrinsic mass tolerances for peptide mass fingerprinting. , 2010, Journal of proteome research.

[24]  K. Parker Scoring methods in MALDI peptide mass fingerprinting: ChemScore, and the ChemApplex program , 2002, Journal of the American Society for Mass Spectrometry.

[25]  Zengyou He,et al.  Peak bagging for peptide mass fingerprinting , 2008, Bioinform..

[26]  Morgan C. Giddings,et al.  High-accuracy peptide mass fingerprinting using peak intensity data with machine learning. , 2008, Journal of proteome research.

[27]  Y. Zhuang,et al.  Isolation of soluble proteins from an industrial strain Streptomyces avermitilis in complex culture medium for two-dimensional gel electrophoresis. , 2008, Journal of microbiological methods.

[28]  Eric D. Dodds,et al.  Systematic characterization of high mass accuracy influence on false discovery and probability scoring in peptide mass fingerprinting. , 2008, Analytical biochemistry.

[29]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[30]  Ron D Appel,et al.  Proteome informatics I: Bioinformatics tools for processing experimental data , 2006, Proteomics.

[31]  Brendan K Faherty,et al.  Optimization and Use of Peptide Mass Measurement Accuracy in Shotgun Proteomics*S , 2006, Molecular & Cellular Proteomics.

[32]  Wen Gao,et al.  Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry , 2004, Bioinform..

[33]  Andreas Wilke,et al.  SAMPI: Protein Identification with Mass Spectra Alignments , 2007, BMC Bioinformatics.

[34]  Conrad Bessant,et al.  Protein and peptide identification algorithms using MS for use in high‐throughput, automated pipelines , 2005, Proteomics.