Search for the smallest random forest.

Random forests have emerged as one of the most commonly used nonparametric statistical methods in many scientific areas, particularly in analysis of high throughput genomic data. A general practice in using random forests is to generate a sufficiently large number of trees, although it is subjective as to how large is sufficient. Furthermore, random forests are viewed as "black-box" because of its sheer size. In this work, we address a fundamental issue in the use of random forests: how large does a random forest have to be? To this end, we propose a specific method to find a sub-forest (e.g., in a single digit number of trees) that can achieve the prediction accuracy of a large random forest (in the order of thousands of trees). We tested it on extensive simulation studies and a real study on prognosis of breast cancer. The results show that such sub-forests usually exist and most of them are very small, suggesting they are actually the "representatives" of the whole random forests. We conclude that the sub-forests are indeed the core of a random forest. Thus it is not necessary to use the whole forest for satisfying prediction performance. Also, by reducing the size of a random forest to a manageable size, the random forest is no longer a black-box.

[1]  Robert J. Mayer,et al.  National Institutes of Health Consensus Development Conference Statement: adjuvant therapy for breast cancer, November 1-3, 2000. , 2001, Journal of the National Cancer Institute.

[2]  Heping Zhang,et al.  A genome-wide tree- and forest-based association analysis of comorbidity of alcoholism and smoking , 2005, BMC Genetics.

[3]  Burton H. Singer,et al.  Recursive partitioning in the health sciences , 1999 .

[4]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[5]  Heping Zhang,et al.  Tree-based analysis of microarray data for classifying breast cancer. , 2002, Frontiers in bioscience : a journal and virtual library.

[6]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[7]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[8]  M. Kosorok,et al.  Marginal asymptotics for the “large $p$, small $n$” paradigm: With applications to microarray data , 2005, math/0508219.

[9]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[10]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[11]  Heping Zhang,et al.  Cell and tumor classification using gene expression data: Construction of forests , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[13]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Min Zhang,et al.  Variable selection for large p small n regression models with incomplete data: Mapping QTL with epistases , 2007, BMC Bioinformatics.

[15]  Mike Clarke,et al.  Polychemotherapy for early breast cancer: an overview of the randomised trials , 1998, The Lancet.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Heping Zhang,et al.  A forest-based approach to identifying gene and gene–gene interactions , 2007, Proceedings of the National Academy of Sciences.