Decision forest for classification of gene expression data

This study attempts to propose an improved decision forest (IDF) with an integrated graphical user interface. Based on four gene expression data sets, the IDF not only outperforms the original decision forest, but also is superior or comparable to other state-of-the-art machine learning methods, especially in dealing with high dimensional data. With an integrated built-in feature selection (FS) mechanism and fewer parameters to tune, it can be trained more efficiently than methods such as support vector machine, and can be built with much fewer trees than other popular tree-based ensemble methods. Moreover, it suffers less from the curse of dimensionality.

[1]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[2]  B Alex Merrick,et al.  Gene expression response in target organ and whole blood varies as a function of target organ injury phenotype , 2008, Genome Biology.

[3]  Nan Hu,et al.  Decision Forest Analysis of 61 Single Nucleotide Polymorphisms in a Case-Control Study of Esophageal Cancer; a novel method , 2005, BMC Bioinformatics.

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[5]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[6]  Ilya Levner,et al.  Feature selection and nearest centroid classification for protein mass spectrometry , 2005, BMC Bioinformatics.

[7]  Huiqing Liu,et al.  Discovery of significant rules for classifying cancer diagnosis data , 2003, ECCB.

[8]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[9]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[10]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[12]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[13]  R Simon,et al.  Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data , 2003, British Journal of Cancer.

[14]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[15]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[16]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[17]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[18]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[19]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[20]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[21]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[22]  Manolis Tsiknakis,et al.  Maturation of a central , 1996 .

[23]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[24]  Weida Tong,et al.  Decision Forest: Combining the Predictions of Multiple Independent Decision Tree Models , 2003, J. Chem. Inf. Comput. Sci..

[25]  H Fang,et al.  Genomic indicators in the blood predict drug-induced liver injury , 2010, The Pharmacogenomics Journal.

[26]  Pierre Geurts,et al.  Proteomic mass spectra classification using decision tree based ensemble methods , 2005, Bioinform..

[27]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[28]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[29]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[30]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.