Genetic Programming for Measuring Peptide Detectability

The biomarker discovery process usually produces a long list of candidates, which need to be verified. The verification of protein biomarkers from mass spectrometry data can be done through measuring the detection probability from the mass spectrometer Peptide detection. However, the limited size of the experimental data and lack of a universal quantitative method make the identification of these peptides challenging. In this paper, genetic programming GP is proposed to measure the detection of the peptides in the mass spectrometer. This is done through measuring the physicochemical chemicals of the peptides and selecting the high responding peptides. The proposed method performs both feature selection and classification, where feature selection is adopted to determine the important physicochemical properties required for the prediction. The proposed GP method is tested on two different yeast data sets with increasing complexity. It outperforms five other state-of-the-art classification algorithms. The results also show that GP outperforms two conventional feature selection methods, namely, Chi Square and Information Gain Ratio.

[1]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[2]  Minoru Kanehisa,et al.  AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[3]  Mengjie Zhang,et al.  Improving Relevance Measures Using Genetic Programming , 2012, EuroGP.

[4]  Ruedi Aebersold,et al.  Options and considerations when selecting a quantitative proteomics strategy , 2010, Nature Biotechnology.

[5]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[6]  Nichole L. King,et al.  The PeptideAtlas Project , 2010, Proteome Bioinformatics.

[7]  Daniel B. Martin,et al.  Computational prediction of proteotypic peptides for quantitative proteomics , 2007, Nature Biotechnology.

[8]  Mengjie Zhang,et al.  Using Genetic Programming for Multiclass Classification by Simultaneously Solving Component Binary Classification Problems , 2005, EuroGP.

[9]  Mark Johnston,et al.  Evolving Diverse Ensembles Using Genetic Programming for Classification With Unbalanced Data , 2013, IEEE Transactions on Evolutionary Computation.

[10]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[11]  Nelson F. F. Ebecken,et al.  Coevolutionary multi-population genetic programming for data classification , 2010, GECCO '10.

[12]  R. Aebersold,et al.  Perspectives of targeted mass spectrometry for protein biomarker verification. , 2009, Current opinion in chemical biology.

[13]  Dongmo Zhang,et al.  AI 2012: Advances in Artificial Intelligence , 2012, Lecture Notes in Computer Science.

[14]  Mark Johnston,et al.  Genetic Programming for Classification with Unbalanced Data , 2010, EuroGP.

[15]  Mengjie Zhang,et al.  Unsupervised Elimination of Redundant Features Using Genetic Programming , 2009, Australasian Conference on Artificial Intelligence.

[16]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[17]  Ihor Batruch,et al.  Verification of a biomarker discovery approach for detection of Down syndrome in amniotic fluid via multiplex selected reaction monitoring (SRM) assay. , 2011, Journal of proteomics.

[18]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[19]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[20]  Douglas B. Kell,et al.  Peptide detectability following ESI mass spectrometry: prediction using genetic programming , 2007, GECCO '07.

[21]  Craig Lawless,et al.  CONSeQuence: Prediction of Reference Peptides for Absolute Quantitative Proteomics Using Consensus Machine Learning Approaches* , 2011, Molecular & Cellular Proteomics.

[22]  Tim W. Nattkemper,et al.  Peak intensity prediction in MALDI-TOF mass spectrometry: A machine learning study to support quantitative proteomics , 2008, BMC Bioinformatics.

[23]  Mengjie Zhang,et al.  Enhanced feature selection for biomarker discovery in LC-MS data using GP , 2013, 2013 IEEE Congress on Evolutionary Computation.

[24]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[25]  Susan E Abbatiello,et al.  Automated detection of inaccurate and imprecise transitions in peptide quantification by multiple reaction monitoring mass spectrometry. , 2010, Clinical chemistry.

[26]  Nikhil R. Pal,et al.  Genetic programming for simultaneous feature selection and classifier design , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[27]  John R. Koza,et al.  Introduction to genetic programming: tutorial , 2008, GECCO '08.

[28]  Mengjie Zhang,et al.  Genetic Programming for Biomarker Detection in Mass Spectrometry Data , 2012, Australasian Conference on Artificial Intelligence.

[29]  D. Hochstrasser,et al.  Peptide mass fingerprinting peak intensity prediction: Extracting knowledge from spectra , 2002, Proteomics.

[30]  Mengjie Zhang,et al.  Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach , 2013, EvoBIO.

[31]  Mengjie Zhang,et al.  Using genetic programming for context-sensitive feature scoring in classification problems , 2011, Connect. Sci..

[32]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[33]  D. Kell,et al.  Explanatory optimization of protein mass spectrometry via genetic search. , 2003, Analytical chemistry.

[34]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[35]  Ian Witten,et al.  Data Mining , 2000 .

[36]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[37]  Mark Johnston,et al.  Developing New Fitness Functions in Genetic Programming for Classification With Unbalanced Data , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[38]  Hanno Steen,et al.  Robust prediction of the MASCOT score for an improved quality assessment in mass spectrometric proteomics. , 2008, Journal of proteome research.

[39]  James P. Reilly,et al.  A computational approach toward label-free protein quantification using predicted peptide detectability , 2006, ISMB.