Gene selection by sequential search wrapper approaches in microarray cancer class prediction

In the last years, there has been a large growth in gene expression profiling technologies, which are expected to provide insight into cancer related cellular processes. Machine Learning algorithms, which are extensively applied in many areas of the real world, are not still popular in the Bioinformatics community. We report on the successful application of four well known supervised Machine Learning methods (IB1, Naive-Bayes, C4.5 and CN2) to cancer class prediction problems in three DNA microarray datasets of huge dimensionality (Colon, Leukemia and NCI-60). The essential gene selection process in microarray domains is performed by a sequential search engine, evaluating the goodness of each gene subset by a wrapper approach which executes, by a leave-one-out process, the supervised algorithm to obtain its accuracy estimation. By the use of the gene selection procedure, the accuracy of supervised algorithms is significantly improved and the number of genes of the classification models is notably reduced for all datasets.

[1]  D. Michie Personal models of rationality , 1990 .

[2]  Byoung-Tak Zhang,et al.  Applying Machine Learning Techniques to Analysis of Gene Expression Data: Cancer Diagnosis , 2002 .

[3]  Pedro Larrañaga,et al.  Feature subset selection by genetic algorithms and estimation of distribution algorithms - A case study in the survival of cirrhotic patients treated with TIPS , 2001, Artif. Intell. Medicine.

[4]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[5]  M. Recce,et al.  A Method to Improve Detection of Disease Using Selectively Expressed Genes in Microarray Data , 2002 .

[6]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[7]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[8]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[9]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[10]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[11]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[12]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[13]  Pedro Larrañaga,et al.  Feature Subset Selection by Bayesian network-based optimization , 2000, Artif. Intell..

[14]  Martin Beibel Selection of Informative Genes in Gene Expression Based Diagnosis: A Nonparametric Approach , 2000, ISMDA.

[15]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[16]  A. Brazma,et al.  Gene expression data analysis , 2000, FEBS letters.

[17]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[18]  Ron Kohavi,et al.  Data Mining Using MLC a Machine Learning Library in C++ , 1996, Int. J. Artif. Intell. Tools.

[19]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[20]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[21]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[22]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[23]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[24]  Jack Sklansky,et al.  On Automatic Feature Selection , 1988, Int. J. Pattern Recognit. Artif. Intell..

[25]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[26]  T. Darden,et al.  Computational Analysis of Leukemia Microarray Expression Data Using the GA/KNN Method , 2002 .