A novel feature subspace selection method in random forests for high dimensional data

Random forests are a class of ensemble methods for classification and regression with randomizing mechanism in bagging instances and selecting feature subspace. For high dimensional data, the performance of random forests degenerates because of the random sampling feature subspace for each node in the construction of decision trees. To address the issue, in this paper, we propose a new Principal Component Analysis and Stratified Sampling based method, called PCA-SS, for feature subspace selection in random forests with high dimensional data. For each decision tree in the forests, we firstly create the training data by bagging instances and partition the feature set into several feature subsets. Principal Component Analysis (PCA) is applied on each feature subset to obtain transformed features. All the principal components are retained in order to preserve the variability information of the data. Secondly, depending on a certain principal components principle, the transformed features are partitioned into informative and less informative parts. When constructing each node of decision trees, a feature subspace is selected by stratified sampling method from the two parts. The PCA-SS based Random Forests algorithm, named PSRF, ensures enough informative features for each tree node, and it also increases the diversity between the trees to a certain extent. Experimental results demonstrate that the proposed PSRF significantly improves the performance of random forests when dealing with high dimensional data, compared with the state-of-the-art random forests algorithms.

[1]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Huiqing Liu,et al.  Mean-entropy discretized features are effective for classifying high-dimensional biomedical data , 2003, BIOKDD.

[4]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[5]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[6]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[7]  M. Ng,et al.  SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests , 2012, IEEE Transactions on NanoBioscience.

[8]  Robert P. W. Duin,et al.  Combining Feature Subsets in Feature Selection , 2005, Multiple Classifier Systems.

[9]  Shannon M. Hughes,et al.  Memory and Computation Efficient PCA via Very Sparse Random Projections , 2014, ICML.

[10]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[11]  C. Chui,et al.  Article in Press Applied and Computational Harmonic Analysis a Randomized Algorithm for the Decomposition of Matrices , 2022 .

[12]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[16]  Joshua Zhexue Huang,et al.  Two-level quantile regression forests for bias correction in range prediction , 2014, Machine Learning.

[17]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[18]  Joshua Zhexue Huang,et al.  A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data , 2015, PAKDD.

[19]  Lawrence O. Hall,et al.  A Comparison of Decision Tree Ensemble Creation Techniques , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Yunming Ye,et al.  Stratified sampling for feature subspace selection in random forests for high dimensional data , 2013, Pattern Recognit..

[21]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[22]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[23]  Terry L King A Guide to Chi-Squared Testing , 1997 .

[24]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[25]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.