Genetic Programming Representations for Multi-dimensional Feature Learning in Biomedical Classification

We present a new classification method that uses genetic programming (GP) to evolve feature transformations for a deterministic, distanced-based classifier. This method, called M4GP, differs from common approaches to classifier representation in GP in that it does not enforce arbitrary decision boundaries and it allows individuals to produce multiple outputs via a stack-based GP system. In comparison to typical methods of classification, M4GP can be advantageous in its ability to produce readable models. We conduct a comprehensive study of M4GP, first in comparison to other GP classifiers, and then in comparison to six common machine learning classifiers. We conduct full hyper-parameter optimization for all of the methods on a suite of 16 biomedical data sets, ranging in size and difficulty. The results indicate that M4GP outperforms other GP methods for classification. M4GP performs competitively with other machine learning methods in terms of the accuracy of the produced models for most problems. M4GP also exhibits the ability to detect epistatic interactions better than the other methods.

[1]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[2]  Saeid Nahavandi,et al.  Hidden Markov models for cancer classification using gene expression profiles , 2015, Inf. Sci..

[3]  Lee Spector,et al.  Epsilon-Lexicase Selection for Regression , 2016, GECCO.

[4]  Georgios A. Pavlopoulos,et al.  Caipirini: using gene sets to rank literature , 2012, BioData Mining.

[5]  Lee Spector,et al.  Solving Uncompromising Problems With Lexicase Selection , 2015, IEEE Transactions on Evolutionary Computation.

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  J. Nelson,et al.  U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center-Fiscal Year 2009 Annual Report , 2010 .

[8]  Jefersson Alex dos Santos,et al.  A relevance feedback method based on genetic programming for classification of remote sensing images , 2011, Inf. Sci..

[9]  Lalit M. Patnaik,et al.  Application of genetic programming for multicategory pattern classification , 2000, IEEE Trans. Evol. Comput..

[10]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[13]  Trent McConaghy,et al.  FFX: Fast, Scalable, Deterministic Symbolic Regression Technology , 2011 .

[14]  Francisco Herrera,et al.  A Survey on the Application of Genetic Programming to Classification , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[15]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[18]  Lee Spector,et al.  Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report , 2012, GECCO '12.

[19]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[20]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[21]  Jason H. Moore,et al.  Identification of Novel Genetic Models of Glaucoma Using the "EMERGENT" Genetic Programming-Based Artificial Intelligence System , 2014, GPTP.

[22]  Josh C. Bongard,et al.  Improving genetic programming based symbolic regression using deterministic machine learning , 2013, 2013 IEEE Congress on Evolutionary Computation.

[23]  Luis Muñoz,et al.  M3GP - Multiclass Classification with GP , 2015, EuroGP.

[24]  Lee Spector,et al.  Inference of compact nonlinear dynamic models by epigenetic local search , 2016, Eng. Appl. Artif. Intell..

[25]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[26]  Tae-Sun Choi,et al.  Genetic programming-based feature transform and classification for the automatic detection of pulmonary nodules on computed tomography images , 2012, Inf. Sci..

[27]  Jun Li,et al.  A Review of Tournament Selection in Genetic Programming , 2010, ISICA.

[28]  Timothy Perkis,et al.  Stack-based genetic programming , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[29]  Fevrier Valdez,et al.  A new neural network model based on the LVQ algorithm for multi-class classification of arrhythmias , 2014, Inf. Sci..

[30]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Kalyan Veeramachaneni,et al.  Building Predictive Models via Feature Synthesis , 2015, GECCO.

[32]  Jason H. Moore,et al.  BIOINFORMATICS REVIEW , 2005 .

[33]  Ling Shao,et al.  Evolutionary compact embedding for large-scale image classification , 2015, Inf. Sci..

[34]  Leonardo Vanneschi,et al.  Classification of oncologic data with genetic programming , 2009 .

[35]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[36]  Jason H. Moore,et al.  GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures , 2012, BioData Mining.

[37]  Leonardo Vanneschi,et al.  A Multi-dimensional Genetic Programming Approach for Multi-class Classification Problems , 2014, EuroGP.

[38]  Vic Ciesielski,et al.  Representing classification problems in genetic programming , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).