An experimental study of the intrinsic stability of random forest variable importance measures

BackgroundThe stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability.ResultsThe experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability.ConclusionFirst, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets.

[1]  Lei Sun,et al.  EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis , 2008, Bioinform..

[2]  Oleg Okun,et al.  Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues , 2007, IbPRIA.

[3]  Antanas Verikas,et al.  Mining data with random forests: A survey and results of new tests , 2011, Pattern Recognit..

[4]  Menglong Li,et al.  Feature importance analysis in guide strand identification of microRNAs , 2011, Comput. Biol. Chem..

[5]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[6]  Kristin K. Nicodemus,et al.  Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures , 2011, Briefings Bioinform..

[7]  Jason H. Moore,et al.  Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types , 2006, 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[8]  Anne-Laure Boulesteix,et al.  Stability and aggregation of ranked gene lists , 2009, Briefings Bioinform..

[9]  Seon-Young Kim,et al.  Effects of sample size on robustness and prediction accuracy of a prognostic gene signature , 2009, BMC Bioinformatics.

[10]  Wang,et al.  Improved variable importance measure of random forest via combining of proximity measure and support vector machine for stable feature selection , 2015 .

[11]  Tin Kam Ho,et al.  A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors , 2002, Pattern Analysis & Applications.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Svetha Venkatesh,et al.  Stable feature selection for clinical prediction: Exploiting ICD tree structure using Tree-Lasso , 2015, J. Biomed. Informatics.

[14]  Thomas Lengauer,et al.  Permutation importance: a corrected feature importance measure , 2010, Bioinform..

[15]  Robin Genuer,et al.  Variance reduction in purely random forests , 2012 .

[16]  Simon Bernard,et al.  Random Forest Classifiers : A Survey and Future Research Directions , 2013 .

[17]  Chris H. Q. Ding,et al.  Consensus group stable feature selection , 2009, KDD.

[18]  Carolin Strobl,et al.  Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations , 2012, Briefings Bioinform..

[19]  Dong-Sheng Cao,et al.  Feature importance sampling‐based adaptive random forest as a useful tool to screen underlying lead compounds , 2011 .

[20]  Víctor Urrea,et al.  Letter to the Editor: Stability of Random Forest importance measures , 2011, Briefings Bioinform..

[21]  M. Verleysen,et al.  Identification of Statistically Significant Features from Random Forests , 2013 .

[22]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[23]  M. Carmen Garrido,et al.  Feature subset selection Filter-Wrapper based on low quality data , 2013, Expert Syst. Appl..

[24]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[25]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[26]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[27]  Jean-Philippe Vert,et al.  The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures , 2011, PloS one.

[28]  C. Ding,et al.  Gene selection algorithm by combining reliefF and mRMR , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[29]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[30]  Yue Han,et al.  A Variance Reduction Framework for Stable Feature Selection , 2010, 2010 IEEE International Conference on Data Mining.

[31]  Ronald Rousseau,et al.  Similarity measures in scientometric research: The Jaccard index versus Salton's cosine formula , 1989, Inf. Process. Manag..

[32]  Yue Han,et al.  Stable Gene Selection from Microarray Data via Sample Weighting , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Christina Schmid,et al.  Performance Evaluation of a Continuous Glucose Monitoring System under Conditions Similar to Daily Life , 2013, Journal of diabetes science and technology.

[34]  Seoung Bum Kim,et al.  Sequential random k-nearest neighbor feature selection for high-dimensional data , 2015, Expert Syst. Appl..

[35]  Huan Liu,et al.  A Dilemma in Assessing Stability of Feature Selection Algorithms , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[36]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[37]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[38]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, WIREs Data Mining Knowl. Discov..

[39]  W. Rubinstein,et al.  Genome-wide analysis of antisense transcription with Affymetrix exon array , 2008, BMC Genomics.

[40]  Miron B. Kursa,et al.  Robustness of Random Forest-based gene selection methods , 2013, BMC Bioinformatics.

[41]  E. S. Pearson,et al.  TESTS FOR RANK CORRELATION COEFFICIENTS. I , 1957 .

[42]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.