Rank Forest: Systematic Attribute Sub-spacing in Decision Forest

Decision Trees are well known classification algorithms that are also appreciated for their capacity for knowledge discovery. In the literature two major shortcomings of decision trees have been pointed out: (1) instability, and (2) high computational cost. These problems have been addressed to some extent through ensemble learning techniques such as Random Forest. Unlike decision trees where the whole attribute space of a dataset is used to discover the best test attribute for a node, in Random Forest a random subspace of attributes is first selected from which the test attribute for a node is then identified. The property that randomly selects an attribute subspace can cause the selection of all/many poor quality attributes in a subspace resulting in an individual tree with low accuracy. Therefore, in this paper we propose a probabilistic selection of attributes (instead of a random selection) where the probability of the selection of an attribute is proportionate to its quality. Although we developed this approach independently, after the research was completed we discovered that some existing techniques also took the same approach. While in this paper we use mutual information as a measure of an attribute quality, the papers in the literature used information gain ratio and a t-test as the measure. The proposed technique has been evaluated using nine different datasets and a stable performance can be seen in terms of the accuracy (ensemble accuracy and individual tree accuracy) and efficiency.

[1]  Md Zahidul Islam,et al.  Knowledge Discovery through SysFor - a Systematically Developed Forest of Multiple Decision Trees , 2011, AusDM.

[2]  Md Zahidul Islam,et al.  EXPLORE: A Novel Decision Tree Classification Algorithm , 2010, BNCOD.

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[6]  Jintao Li,et al.  A study on mutual information-based feature selectionfor text categorization , 2007 .

[7]  Graham J. Williams,et al.  wsrf: An R Package for Classification with Scalable Weighted Subspace Random Forests , 2017 .

[8]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[9]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[10]  Huiqing Liu,et al.  Ensembles of cascading trees , 2003, Third IEEE International Conference on Data Mining.

[11]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Mohammed Erritali,et al.  A comparative study of decision tree ID3 and C4.5 , 2014 .

[13]  Md Zahidul Islam,et al.  Forest CERN: A New Decision Forest Building Technique , 2016, PAKDD.

[14]  Igor Chikalov,et al.  Average Time Complexity of Decision Trees , 2011, Intelligent Systems Reference Library.

[15]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[16]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[17]  Daniel S. Hirschberg,et al.  The Time Complexity of Decision Tree Induction , 1995 .

[18]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[19]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.