Building Diversified Multiple Trees for Classification in High Dimensional Noise Data

It is common that a trained classification model is applied to the operating data that is deviated from the training data because of noise. This paper demonstrate an ensemble classifier, Diversified Multiple Trees (DMT) is more robust to classify noised data than other widely used ensemble methods. DMT is tested on three real world biological data sets from different laboratories in comparison with four benchmark ensemble classifiers. Experimental results show that DMT is significantly more accurate than other benchmark ensemble classifiers on noised test data. We also discussed a limitation of DMT and its possible variations.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  D. Botstein,et al.  Gene expression patterns in human liver cancers. , 2002, Molecular biology of the cell.

[3]  Rocco A. Servedio,et al.  Random classification noise defeats all convex potential boosters , 2008, ICML '08.

[4]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[7]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Mario F. Triola,et al.  Biostatistics for the Biological and Health Sciences , 2005 .

[10]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Wei Wang,et al.  A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. , 2004, Cancer cell.

[12]  Hua Wang,et al.  A maximally diversified multiple decision tree algorithm for microarray data classification , 2006 .

[13]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[14]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[15]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[18]  O. Bagasra,et al.  Proceedings of the National Academy of Sciences , 1914, Science.

[19]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[20]  M. E. Muller,et al.  A Note on the Generation of Random Normal Deviates , 1958 .

[21]  Lawrence O. Hall,et al.  A Comparison of Decision Tree Ensemble Creation Techniques , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.