Rule Extraction from Random Forest: the RF+HC Methods

Random forest (RF) is a tree-based learning method, which exhibits a high ability to generalize on real data sets. Nevertheless, a possible limitation of RF is that it generates a forest consisting of many trees and rules, thus it is viewed as a black box model. In this paper, the RF+HC methods for rule extraction from RF are proposed. Once the RF is built, a hill climbing algorithm is used to search for a rule set such that it reduces the number of rules dramatically, which significantly improves comprehensibility of the underlying model built by RF. The proposed methods are evaluated on eighteen UCI and four microarray data sets. Our experimental results show that the proposed methods outperform one of the state-of-the-art methods in terms of scalability and comprehensibility while preserving the same level of accuracy.

[1]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[2]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[3]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Olivier Debeir,et al.  Limiting the Number of Trees in Random Forests , 2001, Multiple Classifier Systems.

[5]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[6]  Zhi-Hua Zhou,et al.  Extracting symbolic rules from trained neural network ensembles , 2003, AI Commun..

[7]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[8]  Lynn Nadel,et al.  Encyclopedia of Cognitive Science , 2003 .

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[11]  Bart Baesens,et al.  Using Rule Extraction to Improve the Comprehensibility of Predictive Models , 2006 .

[12]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[13]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[14]  B. Selman,et al.  Hill‐climbing Search , 2006 .

[15]  S. Horvath,et al.  Unsupervised Learning With Random Forest Predictors , 2006 .

[16]  Hendrik Blockeel,et al.  Seeing the Forest Through the Trees: Learning a Comprehensible Model from an Ensemble , 2007, ECML.

[17]  Bogdan E. Popescu,et al.  PREDICTIVE LEARNING VIA RULE ENSEMBLES , 2008, 0811.1679.

[18]  Daniel Hernández-Lobato,et al.  An Analysis of Ensemble Pruning Techniques Based on Ordered Aggregation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Laurent Heutte,et al.  On the selection of decision trees in Random Forests , 2009, 2009 International Joint Conference on Neural Networks.

[20]  Heping Zhang,et al.  Search for the smallest random forest. , 2009, Statistics and its interface.

[21]  N. Meinshausen Node harvest: simple and interpretable regression and classication , 2009, 0910.2145.

[22]  Hiroyuki Yoshida,et al.  Comparative Performance of Random Forest and Support Vector Machine Classifiers for Detection of Colorectal Lesions in CT Colonography , 2011, Abdominal Imaging.

[23]  Yixin Chen,et al.  Multi-class Joint Rule Extraction and Feature Selection for Biological Data , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[24]  Tuve Löfström,et al.  One tree to explain them all , 2011, 2011 IEEE Congress of Evolutionary Computation (CEC).

[25]  Sheng Liu,et al.  Combined Rule Extraction and Feature Elimination in Supervised Classification , 2012, IEEE Transactions on NanoBioscience.

[26]  Lin Song,et al.  Random generalized linear model: a highly accurate and interpretable ensemble predictor , 2013, BMC Bioinformatics.

[27]  K. S. Chaudhuri,et al.  genetic algorithm-based rule extraction system ikash , 2011 .

[28]  Fan Yang,et al.  Margin optimization based pruning for random forest , 2012, Neurocomputing.