Building Classification Models from Microarray Data with Tree-Based Classification Algorithms

Building classification models plays an important role in DNA mircroarray data analyses. An essential feature of DNA microarray data sets is that the number of input variables (genes) is far greater than the number of samples. As such, most classification schemes employ variable selection or feature selection methods to pre-process DNA microarray data. This paper investigates various aspects of building classification models from microarray data with tree-based classification algorithms by using Partial Least-Squares (PLS) regression as a feature selection method. Experimental results show that the Partial Least-Squares (PLS) regression method is an appropriate feature selection method and tree-based ensemble models are capable of delivering high performance classification models for microarray data.

[1]  M. Otto Potential pattern recognition in chemical and medical division making : by D. Coomans and I. Broeckaert , 1987 .

[2]  A. Höskuldsson PLS regression methods , 1988 .

[3]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[4]  David L. Dowe,et al.  MML Inference of Oblique Decision Trees , 2004, Australian Conference on Artificial Intelligence.

[5]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[6]  A. Boulesteix Statistical Applications in Genetics and Molecular Biology PLS Dimension Reduction for Classification with Microarray Data , 2011 .

[7]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[8]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[9]  Barbara J. Wold,et al.  Mining gene expression data by interpreting principal components , 2006, BMC Bioinformatics.

[10]  Danh V. Nguyen,et al.  On partial least squares dimension reduction for microarray-based classification: a simulation study , 2004, Comput. Stat. Data Anal..

[11]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[12]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[13]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[14]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  J. M. Deutsch,et al.  Evolutionary algorithms for finding optimal gene sets in microarray prediction , 2003, Bioinform..

[17]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[18]  Thanyaluk Jirapech-umpai,et al.  Classifying Gene Expression Data using an Evolutionary Algorithm , 2004 .

[19]  David L. Dowe,et al.  Decision Forests with Oblique Decision Trees , 2006, MICAI.

[20]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.