Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.

[1]  George C. Runger,et al.  Gene selection with guided regularized random forest , 2012, Pattern Recognit..

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[3]  Allan P. White,et al.  Technical Note: Bias in Information-Based Measures in Decision Tree Induction , 1994, Machine Learning.

[4]  David J. Kriegman,et al.  From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Yunming Ye,et al.  A feature group weighting method for subspace clustering of high-dimensional data , 2012, Pattern Recognit..

[6]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[7]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[8]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[9]  Houtao Deng,et al.  Guided Random Forest in the RRF Package , 2013, ArXiv.

[10]  Carolin Strobl,et al.  Unbiased split selection for classification trees based on the Gini Index , 2007, Comput. Stat. Data Anal..

[11]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[12]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[14]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[15]  K. Hornik,et al.  party : A Laboratory for Recursive Partytioning , 2009 .

[16]  Nguyen Thanh Tung,et al.  Extensions to Quantile Regression Forests for Very High-Dimensional Data , 2014, PAKDD.

[17]  Yunming Ye,et al.  Stratified sampling for feature subspace selection in random forests for high dimensional data , 2013, Pattern Recognit..

[18]  Joshua Zhexue Huang,et al.  Two-level quantile regression forests for bias correction in range prediction , 2014, Machine Learning.

[19]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[20]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[21]  Andy Harter,et al.  Parameterisation of a stochastic model for human face identification , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[22]  Yunming Ye,et al.  Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces , 2012, Int. J. Data Warehous. Min..

[23]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[24]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[25]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[26]  Wei Zhong Liu,et al.  Bias in information-based measures in decision tree induction , 1994, Machine Learning.

[27]  T-T Nguyen,et al.  A real time license plate detection system based on boosting learning algorithm , 2012, 2012 5th International Congress on Image and Signal Processing.

[28]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[29]  Paola Zuccolotto,et al.  Variable Selection Using Random Forests , 2006 .

[30]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.