Auto-CES: An Automatic Pruning Method Through Clustering Ensemble Selection

Ensemble learning is a machine learning approach where multiple learners are trained to solve a particular problem. Random Forest is an ensemble learning algorithm which comprises numerous decision trees and nominates a class through majority voting for classification and averaging approach for regression. The prior research affirms that the learning time of the Random Forest algorithm linearly increases when the number of trees in the forest augments. This large number of decision trees in the Random Forest can cause certain challenges. Firstly, it can enlarge the model complexity, and secondly, it can negatively affect the efficiency of large-scale datasets. Hence, ensemble pruning methods (e.g. Clustering Ensemble Selection (CES)) are devised to select a subset of decision trees out of the forest. The main challenge is that the prior CES models require the number of clusters as input. To solve the problem, we devise an Automatic CES pruning model (Auto-CES) for Random Forest which can automatically find the proper number of clusters. Our proposed model is able to obtain an optimal subset of trees that can provide the same or even better effectiveness compared to the original set. Auto-CES has two components: clustering and selection. First, our algorithm utilizes a new clustering technique to classify homogeneous trees. In selection part, it takes both accuracy and diversity of the trees into consideration to choose the best tree.

[1]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[2]  Heping Zhang,et al.  Search for the smallest random forest. , 2009, Statistics and its interface.

[3]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[4]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[8]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[9]  Olivier Debeir,et al.  Limiting the Number of Trees in Random Forests , 2001, Multiple Classifier Systems.

[10]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[11]  Xin Yao,et al.  An analysis of diversity measures , 2006, Machine Learning.

[12]  Dimitrios I. Fotiadis,et al.  Dynamic construction of Random Forests: Evaluation using biomedical engineering problems , 2010, Proceedings of the 10th IEEE International Conference on Information Technology and Applications in Biomedicine.

[13]  Sylvain Piechowiak,et al.  On the Effectiveness of Diversity When Training Multiple Classifier Systems , 2009, ECSQARU.

[14]  Wei-Shu Hou,et al.  Implications for B ->eta K and B -> glueball+K modes from observed large B ->eta , 1997 .

[15]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[16]  Xiaoli Z. Fern,et al.  Cluster Ensemble Selection , 2008 .

[17]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  S. Bharathidason,et al.  Improving Classification Accuracy based on Random Forest Model with Uncorrelated High Performing Trees , 2014 .

[20]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[21]  Laurent Heutte,et al.  On the selection of decision trees in Random Forests , 2009, 2009 International Joint Conference on Neural Networks.

[22]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[23]  Haytham Elghazel,et al.  Trading-Off Diversity and Accuracy for Optimal Ensemble Tree Selection in Random Forests , 2011, Ensembles in Machine Learning Applications.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Mohamed Medhat Gaber,et al.  CLUB-DRF: A Clustering Approach to Extreme Pruning of Random Forests , 2015, SGAI Conf..