Genetic Programming for Preprocessing Tandem Mass Spectra to Improve the Reliability of Peptide Identification

Tandem mass spectrometry (MS/MS) is currently the most commonly used technology in proteomics for identifying proteins in complex biological samples. Mass spectrometers can produce a large number of MS/MS spectra each of which has hundreds of peaks. These peaks normally contain background noise, therefore a preprocessing step to filter the noise peaks can improve the accuracy and reliability of peptide identification. This paper proposes to preprocess the data by classifying peaks as noise peaks or signal peaks, i.e., a highly-imbalanced binary classification task, and uses genetic programming (GP) to address this task. The expectation is to increase the peptide identification reliability. Meanwhile, six different types of classification algorithms in addition to GP are used on various imbalance ratios and evaluated in terms of the average accuracy and recall. The GP method appears to be the best in the retention of more signal peaks as examined on a benchmark dataset containing 1, 674 MS/MS spectra. To further evaluate the effectiveness of the GP method, the preprocessed spectral data is submitted to a benchmark de novo sequencing software, PEAKS, to identify the peptides. The results show that the proposed method improves the reliability of peptide identification compared to the original un-preprocessed data and the intensity-based thresholding methods.

[1]  M. Mann,et al.  Proteomics to study genes and genomes , 2000, Nature.

[2]  C. Ling,et al.  PeakSelect: preprocessing tandem mass spectra for better peptide identification. , 2008, Rapid communications in mass spectrometry : RCM.

[3]  J V Tu,et al.  Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. , 1996, Journal of clinical epidemiology.

[4]  Donna R. Maglott,et al.  NCBI's LocusLink and RefSeq , 2000, Nucleic Acids Res..

[5]  Mengjie Zhang,et al.  Genetic programming for feature construction and selection in classification on high-dimensional data , 2016, Memetic Comput..

[6]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[7]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[8]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[9]  Peptide Fragmentation Overview , 2006 .

[10]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[11]  Stjepan Picek,et al.  Automatic Feature Construction for Network Intrusion Detection , 2017, SEAL.

[12]  D. L. Pavia,et al.  Introduction to Spectroscopy , 1978 .

[13]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[14]  Steven P Gygi,et al.  Intensity-based protein identification by machine learning from a library of tandem mass spectra , 2004, Nature Biotechnology.

[15]  Mengjie Zhang,et al.  Prediction of detectable peptides in MS data using genetic programming , 2014, GECCO.

[16]  Weichuan Yu,et al.  PIPI: PTM-Invariant Peptide Identification Using Coding Method , 2016, bioRxiv.

[17]  Mark Kotanchek,et al.  Pareto-Front Exploitation in Symbolic Regression , 2005 .

[18]  James P Cleveland,et al.  Identification of b-/y-ions in MS/MS spectra using a two stage neural network , 2013, Proteome Science.

[19]  Mengjie Zhang,et al.  Improving feature ranking for biomarker discovery in proteomics mass spectrometry data using genetic programming , 2014, Connect. Sci..

[20]  David R. White Software review: the ECJ toolkit , 2011, Genetic Programming and Evolvable Machines.

[21]  B. Ma Novor: Real-Time Peptide de Novo Sequencing Software , 2015, Journal of The American Society for Mass Spectrometry.

[22]  Jianfeng Feng,et al.  A machine learning approach to explore the spectra intensity pattern of peptides using tandem mass spectrometry data , 2008, BMC Bioinformatics.

[23]  Francisco Herrera,et al.  A Survey on the Application of Genetic Programming to Classification , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[24]  D. Wiesmann,et al.  Evolutionary Optimization Algorithms in Computational Optics , 1999 .

[25]  Mark Johnston,et al.  Evolving Diverse Ensembles Using Genetic Programming for Classification With Unbalanced Data , 2013, IEEE Transactions on Evolutionary Computation.

[26]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[27]  Ron Wehrens,et al.  A comprehensive full factorial LC‐MS/MS proteomics benchmark data set , 2012, Proteomics.