A new GP-based wrapper feature construction approach to classification and biomarker identification

Mass spectrometry (MS) is a technology used for identification and quantification of proteins and metabolites. It helps in the discovery of proteomic or metabolomic biomarkers, which aid in diseases detection and drug discovery. The detection of biomarkers is performed through the classification of patients from healthy samples. The mass spectrometer produces high dimensional data where most of the features are irrelevant for classification. Therefore, feature reduction is needed before the classification of MS data can be done effectively. Feature construction can provide a means of dimensionality reduction and aims at improving the classification performance. In this paper, genetic programming (GP) is used for construction of multiple features. Two methods are proposed for this objective. The proposed methods work by wrapping a Random Forest (RF) classifier to GP to ensure the quality of the constructed features. Meanwhile, five other classifiers in addition to RF are used to test the impact of the constructed features on the performance of these classifiers. The results show that the proposed GP methods improved the performance of classification over using the original set of features in five MS data sets.

[1]  R. Matthiesen Mass Spectrometry Data Analysis in Proteomics , 2006, Methods in Molecular Biology.

[2]  Mark Johnston,et al.  Genetic Programming for Automatic Construction of Variant Features in Edge Detection , 2013, EvoApplications.

[3]  Mengjie Zhang,et al.  Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach , 2013, EvoBIO.

[4]  Concha Bielza,et al.  Peakbin Selection in Mass Spectrometry Data Using a Consensus Approach with Estimation of Distribution Algorithms , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  George D. Smith,et al.  Evolutionary constructive induction , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  Mengjie Zhang,et al.  Genetic Programming for Biomarker Detection in Mass Spectrometry Data , 2012, Australasian Conference on Artificial Intelligence.

[7]  Susmita Datta Feature selection and machine learning with mass spectrometry data. , 2013, Methods in molecular biology.

[8]  Mark Johnston,et al.  Feature Construction and Dimension Reduction Using Genetic Programming , 2007, Australian Conference on Artificial Intelligence.

[9]  E. Petricoin,et al.  Toxicoproteomics: Serum Proteomic Pattern Diagnostics for Early Detection of Drug Induced Cardiac Toxicities and Cardioprotection , 2004, Toxicologic pathology.

[10]  Leonard E. Trigg An entropy gain measure of numeric prediction performance , 1998 .

[11]  Albert Y. Zomaya,et al.  Ensemble-Based Wrapper Methods for Feature Selection and Class Imbalance Learning , 2013, PAKDD.

[12]  Julian Francis Miller,et al.  Cartesian genetic programming , 2010, GECCO.

[13]  Mengjie Zhang,et al.  A Filter Approach to Multiple Feature Construction for Symbolic Learning Classifiers Using Genetic Programming , 2012, IEEE Transactions on Evolutionary Computation.

[14]  Krzysztof Krawiec,et al.  Genetic Programming-based Construction of Features for Machine Learning and Knowledge Discovery Tasks , 2002, Genetic Programming and Evolvable Machines.

[15]  Larry Bull,et al.  Feature Construction and Selection Using Genetic Programming and a Genetic Algorithm , 2003, EuroGP.

[16]  Krzysztof Krawiec,et al.  Visual learning by coevolutionary feature synthesis , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17]  Peter A. Whigham,et al.  Grammar-based Genetic Programming: a survey , 2010, Genetic Programming and Evolvable Machines.

[18]  Erik D. Goodman,et al.  On Prediction of Epileptic Seizures by Computing Multiple Genetic Programming Artificial Features , 2005, EuroGP.

[19]  John R. Koza,et al.  Introduction to genetic programming: tutorial , 2008, GECCO '08.

[20]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian Cancer , 2002 .

[21]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[22]  Sharon L R Kardia,et al.  Proteomic analysis of lung adenocarcinoma: identification of a highly expressed set of proteins in tumors. , 2002, Clinical cancer research : an official journal of the American Association for Cancer Research.

[23]  Peter Nordin,et al.  Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications , 1998 .

[24]  P. Schellhammer,et al.  Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. , 2002, Cancer research.

[25]  Habtom W. Ressom,et al.  Ant Colony Optimization for Biomarker Identification from MALDI-TOF Mass Spectra , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[26]  E. Petricoin,et al.  Preinvasive and invasive ductal pancreatic cancer and its early detection in the mouse. , 2003, Cancer cell.

[27]  Mengjie Zhang,et al.  Enhanced feature selection for biomarker discovery in LC-MS data using GP , 2013, 2013 IEEE Congress on Evolutionary Computation.

[28]  Mark Johnston,et al.  Automatic Construction of Invariant Features Using Genetic Programming for Edge Detection , 2012, Australasian Conference on Artificial Intelligence.

[29]  Jill S. Barnholtz-Sloan,et al.  Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes , 2012, Journal of Clinical Bioinformatics.

[30]  Asoke K. Nandi,et al.  Breast Cancer Diagnosis Using Genetic Programming Generated Feature , 2005 .