Ensemble of optimal trees, random forest and random projection ensemble classification

The predictive performance of a random forest ensemble is highly associated with the strength of individual trees and their diversity. Ensemble of a small number of accurate and diverse trees, if prediction accuracy is not compromised, will also reduce computational burden. We investigate the idea of integrating trees that are accurate and diverse. For this purpose, we utilize out-of-bag observations as a validation sample from the training bootstrap samples, to choose the best trees based on their individual performance and then assess these trees for diversity using the Brier score on an independent validation sample. Starting from the first best tree, a tree is selected for the final ensemble if its addition to the forest reduces error of the trees that have already been added. Our approach does not use an implicit dimension reduction for each tree as random project ensemble classification. A total of 35 bench mark problems on classification and regression are used to assess the performance of the proposed method and compare it with random forest, random projection ensemble, node harvest, support vector machine, k NN and classification and regression tree. We compute unexplained variances or classification error rates for all the methods on the corresponding data sets. Our experiments reveal that the size of the ensemble is reduced significantly and better results are obtained in most of the cases. Results of a simulation study are also given where four tree style scenarios are considered to generate data sets with several structures.

[1]  Alexander Hapfelmeier,et al.  A new variable selection approach using Random Forests , 2013, Comput. Stat. Data Anal..

[2]  Olivier Debeir,et al.  Limiting the Number of Trees in Random Forests , 2001, Multiple Classifier Systems.

[3]  B. Narasimhan,et al.  Bone mineral acquisition in healthy Asian, Hispanic, black, and Caucasian youth: a longitudinal study. , 1999, The Journal of clinical endocrinology and metabolism.

[4]  N. Meinshausen Node harvest: simple and interpretable regression and classication , 2009, 0910.2145.

[5]  Verónica Bolón-Canedo,et al.  An ensemble of filters and classifiers for microarray data classification , 2012, Pattern Recognit..

[6]  Kapil Sharma,et al.  Cost-effectiveness of classification ensembles , 2016, Pattern Recognit..

[7]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[8]  Anne-Laure Boulesteix,et al.  A computationally fast variable importance test for random forests for high-dimensional data , 2015, Adv. Data Anal. Classif..

[9]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[10]  Andrew Harrison,et al.  A feature selection method for classification within functional genomics experiments based on the proportional overlapping score , 2014, BMC Bioinformatics.

[11]  Berthold Lausen,et al.  Ensemble of Subset of k-Nearest Neighbours Models for Class Membership Probability Estimation , 2016, ECDA.

[12]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[13]  S. Chao,et al.  High dimensional gene expression data dimension reduction , 2004, IEEE Conference on Cybernetics and Intelligent Systems, 2004..

[14]  Heping Zhang,et al.  Search for the smallest random forest. , 2009, Statistics and its interface.

[15]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[16]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[17]  Berthold Lausen,et al.  Ensemble of a subset of kNN classifiers , 2018, Adv. Data Anal. Classif..

[18]  R. Buta The Structure and Dynamics of Ringed Galaxies. III. Surface Photometry and Kinematics of the Ringed Nonbarred Spiral NGC 7531 , 1987 .

[19]  Torsten Hothorn,et al.  Double-Bagging: Combining Classifiers by Bootstrap Aggregation , 2002, Pattern Recognit..

[20]  Lyn-Rouven Schirra,et al.  Rank-based classifiers for extremely high-dimensional gene expression data , 2018, Adv. Data Anal. Classif..

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Berthold Lausen,et al.  An Ensemble of Optimal Trees for Class Membership Probability Estimation , 2016, ECDA.

[23]  Gilbert Saporta,et al.  Advances in credit scoring: combining performance and interpretation in kernel discriminant analysis , 2017, Adv. Data Anal. Classif..

[24]  Akinori Okada,et al.  Editorial for Special Issue on Analysis of Asymmetric Relationships , 2018, Adv. Data Anal. Classif..

[25]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[26]  Laurent Heutte,et al.  On the selection of decision trees in Random Forests , 2009, 2009 International Joint Conference on Neural Networks.

[27]  Berthold Lausen,et al.  Ensemble Pruning for Glaucoma Detection in an Unbalanced Data Set , 2016, Methods of Information in Medicine.

[28]  M. Pazzani,et al.  Error Reduction through Learning Multiple Descriptions , 1996, Machine Learning.

[29]  José Augusto Baranauskas,et al.  How Many Trees in a Random Forest? , 2012, MLDM.

[30]  B Lausen,et al.  Comparison of classifiers applied to confocal scanning laser ophthalmoscopy data. , 2008, Methods of information in medicine.

[31]  J. Friedman Multivariate adaptive regression splines , 1990 .

[32]  Mohamed Limam,et al.  Ensemble feature selection for high dimensional data: a new method and a comparative study , 2017, Advances in Data Analysis and Classification.

[33]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[34]  R. Samworth,et al.  Random‐projection ensemble classification , 2015, 1504.04595.

[35]  Hongwei Ding,et al.  Trees Weighting Random Forest Method for Classifying High-Dimensional Noisy Data , 2010, 2010 IEEE 7th International Conference on E-Business Engineering.

[36]  Michael J. Pazzani,et al.  Error reduction through learning multiple descriptions , 2004, Machine Learning.

[37]  Christos Tjortjis,et al.  T3C: improving a decision tree classification algorithm’s interval splits on continuous attributes , 2017, Adv. Data Anal. Classif..