Modifications of the construction and voting mechanisms of the Random Forests Algorithm

The aim of this work is to propose modifications of the Random Forests algorithm which improve its prediction performance. The suggested modifications intend to increase the strength and decrease the correlation of individual trees of the forest and to improve the function which determines how the outputs of the base classifiers are combined. This is achieved by modifying the node splitting and the voting procedure. Different approaches concerning the number of the predictors and the evaluation measure which determines the impurity of the node are examined. Regarding the voting procedure, modifications based on feature selection, clustering, nearest neighbors and optimization techniques are proposed. The novel feature of the current work is that it proposes modifications, not only for the improvement of the construction or the voting mechanisms but also, for the first time, it examines the overall improvement of the Random Forests algorithm (a combination of construction and voting). We evaluate the proposed modifications using 24 datasets. The evaluation demonstrates that the proposed modifications have positive effect on the performance of the Random Forests algorithm and they provide comparable, and, in most cases, better results than the existing approaches.

[1]  Horst Bunke,et al.  Optimization of Weights in a Multiple Classifier Handwritten Word Recognition System Using a Genetic Algorithm , 2004 .

[2]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[3]  Liangxiao Jiang,et al.  Learning random forests for ranking , 2011, Frontiers of Computer Science in China.

[4]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[5]  Johannes Fürnkranz,et al.  On the quest for optimal rule learning heuristics , 2010, Machine Learning.

[6]  Ron Shamir,et al.  CLICK and EXPANDER: a system for clustering and visualizing gene expression data , 2003, Bioinform..

[7]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[10]  William G. Hanley,et al.  An Extended Study of the Discriminant Random Forest , 2010, Data Mining.

[11]  Francisco Herrera,et al.  A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests , 2007, Expert Syst. Appl..

[12]  Marko Robnik-Sikonja,et al.  Improving Random Forests , 2004, ECML.

[13]  Eugeniusz Gatnar A Diversity Measure for Tree-Based Classifier Ensembles , 2005, Data Analysis and Decision Support.

[14]  Dimitrios I. Fotiadis,et al.  A six stage approach for the diagnosis of the Alzheimer's disease based on fMRI data , 2010, J. Biomed. Informatics.

[15]  Padraig Cunningham,et al.  A Taxonomy of Similarity Mechanisms for Case-Based Reasoning , 2009, IEEE Transactions on Knowledge and Data Engineering.

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[17]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[18]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[19]  Hua Wang,et al.  A maximally diversified multiple decision tree algorithm for microarray data classification , 2006 .

[20]  Guoping Qiu,et al.  Fast semantic image retrieval based on random forest , 2012, ACM Multimedia.

[21]  Nikunj C. Oza Multiple Classifier Systems, 6th International Workshop, MCS 2005, Seaside, CA, USA, June 13-15, 2005, Proceedings , 2005, Multiple Classifier Systems.

[22]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[23]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[24]  Laurent Heutte,et al.  On the selection of decision trees in Random Forests , 2009, 2009 International Joint Conference on Neural Networks.

[25]  Dong-Jo Park,et al.  A Novel Validity Index for Determination of the Optimal Number of Clusters , 2001 .

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Amparo Albalate,et al.  A Combination Approach to Cluster Validation Based on Statistical Quantiles , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[28]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[29]  Qian Zhang,et al.  Random Forest for Image Annotation , 2012, ECCV.

[30]  Laurent Heutte,et al.  Forest-RK: A New Random Forest Induction Method , 2008, ICIC.

[31]  Henrik Boström Estimating class probabilities in random forests , 2007, ICMLA 2007.

[32]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[33]  Lawrence Mitchell,et al.  A parallel random forest classifier for R , 2011, ECMLS '11.

[34]  Asif Ekbal,et al.  Combining multiple classifiers using vote based classifier ensemble technique for named entity recognition , 2013, Data Knowl. Eng..

[35]  Liangxiao Jiang,et al.  Learning decision tree for ranking , 2009, Knowledge and Information Systems.

[36]  Liangxiao Jiang,et al.  An Empirical Study on Class Probability Estimates in Decision Tree Learning , 2011, J. Softw..

[37]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Mykola Pechenizkiy,et al.  Dynamic Integration with Random Forests , 2006, ECML.