Please Stop Permuting Features: An Explanation and Alternatives

This paper advocates against permute-and-predict (PaP) methods for interpreting black box functions. Methods such as the variable importance measures proposed for random forests, partial dependence plots, and individual conditional expectation plots remain popular because of their ability to provide model-agnostic measures that depend only on the pre-trained model output. However, numerous studies have found that these tools can produce diagnostics that are highly misleading, particularly when there is strong dependence among features. Rather than simply add to this growing literature by further demonstrating such issues, here we seek to provide an explanation for the observed behavior. In particular, we argue that breaking dependencies between features in hold-out data places undue emphasis on sparse regions of the feature space by forcing the original model to extrapolate to regions where there is little to no data. We explore these effects through various settings where a ground-truth is understood and find support for previous claims in the literature that PaP metrics tend to over-emphasize correlated features both in variable importance and partial dependence plots, even though applying permutation methods to the ground-truth models do not. As an alternative, we recommend more direct approaches that have proven successful in other settings: explicitly removing features, conditional permutations, or model distillation methods.

[1]  L. Stefanski,et al.  Controlling Variable Selection by the Addition of Pseudovariables , 2007 .

[2]  Art B. Owen,et al.  Sobol' Indices and Shapley Value , 2014, SIAM/ASA J. Uncertain. Quantification.

[3]  Bertrand Michel,et al.  Grouped variable importance with random forests and application to multiple functional data analysis , 2014, Comput. Stat. Data Anal..

[4]  Alessandro Rinaldo,et al.  Distribution-Free Predictive Inference for Regression , 2016, Journal of the American Statistical Association.

[5]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Lucas Janson,et al.  Panning for gold: ‘model‐X’ knockoffs for high dimensional controlled variable selection , 2016, 1610.02351.

[8]  Johannes Gehrke,et al.  Accurate intelligible models with pairwise interactions , 2013, KDD.

[9]  George C. Runger,et al.  Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination , 2009, J. Mach. Learn. Res..

[10]  Giles Hooker,et al.  Unbiased Measurement of Feature Importance in Tree-Based Methods , 2019, ACM Trans. Knowl. Discov. Data.

[11]  Carolin Strobl,et al.  The behaviour of random forest permutation-based variable importance measures under predictor correlation , 2010, BMC Bioinformatics.

[12]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[13]  W. Hoeffding A Class of Statistics with Asymptotically Normal Distribution , 1948 .

[14]  R. Nelsen An Introduction to Copulas , 1998 .

[15]  Albert Gordo,et al.  Transparent Model Distillation , 2018, ArXiv.

[16]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[17]  Giles Hooker,et al.  Detecting Feature Interactions in Bagged Trees and Random Forests , 2014, 1406.1845.

[18]  Cynthia Rudin,et al.  Model Class Reliance: Variable Importance Measures for any Machine Learning Model Class, from the "Rashomon" Perspective , 2018 .

[19]  Hadi Fanaee-T,et al.  Event labeling combining ensemble detectors and background knowledge , 2014, Progress in Artificial Intelligence.

[20]  S. Wood Generalized Additive Models: An Introduction with R , 2006 .

[21]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[22]  Emil Pitkin,et al.  Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation , 2013, 1309.6392.

[23]  Ying Liu,et al.  Auto-Encoding Knockoff Generator for FDR Controlled Variable Selection , 2018, 1809.10765.

[24]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[25]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[26]  E. Candès,et al.  Controlling the false discovery rate via knockoffs , 2014, 1404.5609.

[27]  C. Prieur,et al.  Generalized Hoeffding-Sobol Decomposition for Dependent Variables -Application to Sensitivity Analysis , 2011, 1112.1788.

[28]  G. Hooker Generalized Functional ANOVA Diagnostics for High-Dimensional Functions of Dependent Variables , 2007 .

[29]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[30]  Rich Caruana,et al.  Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation , 2017, AIES.

[31]  Thomas Lengauer,et al.  Classification with correlated features: unreliability of feature ranking and solutions , 2011, Bioinform..

[32]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[33]  Albert Gordo,et al.  Learning Global Additive Explanations for Neural Nets Using Model Distillation , 2018 .