Classification by ensembles from random partitions of high-dimensional data

A robust classification procedure is developed based on ensembles of classifiers, with each classifier constructed from a different set of predictors determined by a random partition of the entire set of predictors. The proposed methods combine the results of multiple classifiers to achieve a substantially improved prediction compared to the optimal single classifier. This approach is designed specifically for high-dimensional data sets for which a classifier is sought. By combining classifiers built from each subspace of the predictors, the proposed methods achieve a computational advantage in tackling the growing problem of dimensionality. For each subspace of the predictors, we build a classification tree or logistic regression tree. Our study shows, using four real data sets from different areas, that our methods perform consistently well compared to widely used classification methods. For unbalanced data, our approach maintains the balance between sensitivity and specificity more adequately than many other classification methods considered in this study.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  Robert P. W. Duin,et al.  Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  L. Breiman Arcing Classifiers , 1998 .

[5]  H. Ahn,et al.  Tree-structured logistic models for over-dispersed binomial data with application to modeling developmental effects. , 1997, Biometrics.

[6]  H. Akaike A new look at the statistical model identification , 1974 .

[7]  D. A. Williams,et al.  The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. , 1975, Biometrics.

[8]  W. Reik,et al.  Genomic imprinting: parental influence on the genome , 2001, Nature Reviews Genetics.

[9]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[10]  Ross L. Prentice,et al.  Binary Regression Using an Extended Beta-Binomial Distribution, with Discussion of Correlation Induced by Covariate Measurement Errors , 1986 .

[11]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[12]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[13]  Katarzyna Kalita,et al.  Estrogen receptor β , 2002 .

[14]  Robert P. W. Duin,et al.  Experiments with Classifier Combining Rules , 2000, Multiple Classifier Systems.

[15]  J. Satagopan,et al.  Center for Bioinformatics and Molecular Biostatistics Uc San Francisco , 2006 .

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[17]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[18]  Weida Tong,et al.  Phytoestrogens and mycoestrogens bind to the rat uterine estrogen receptor. , 2002, The Journal of nutrition.

[19]  B. Hileman HORMONE DISRUPTER RESEARCH EXPANDS , 1997 .

[20]  Peter Bühlmann,et al.  Boosting for Tumor Classification with Gene Expression Data , 2003, Bioinform..

[21]  Williams Da,et al.  The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. , 1975 .

[22]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Bette Hileman BP Shifts Policy On Climate Change , 1997 .

[25]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[28]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[29]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[30]  J. Sparano,et al.  Clinical application of molecular profiling in breast cancer. , 2005, Future oncology.

[31]  Ching Y. Suen,et al.  Application of majority voting to pattern recognition: an analysis of its behavior and performance , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[32]  H Fang,et al.  The estrogen receptor relative binding affinities of 188 natural and xenochemicals: structural diversity of ligands. , 2000, Toxicological sciences : an official journal of the Society of Toxicology.

[33]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[34]  A. Atkinson Subset Selection in Regression , 1992 .

[35]  Chen-An Tsai,et al.  Gene selection for sample classifications in microarray experiments. , 2004, DNA and cell biology.

[36]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[37]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[38]  Bette Hileman MISCONDUCT IN SCIENCE PROBED , 1997 .

[39]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[40]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[41]  Sandrine Dudoit,et al.  Classification in microarray experiments , 2003 .

[42]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[43]  John M. Greally,et al.  Short interspersed transposable elements (SINEs) are excluded from imprinted regions in the human genome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[44]  J. J. Chen,et al.  Classification ensembles for unbalanced class sizes in predictive toxicology , 2005, SAR and QSAR in environmental research.

[45]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[46]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[47]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[48]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .