Optimal Piecewise Approximations for Model Interpretation

Recent literature interprets the predictions of "black-box" machine learning models (Neural Networks, Random Forests, etc.) by approximating these models in terms of simpler models such as piecewise linear or piecewise constant models. Existing literature does not provide guarantees on whether these approximations reflect the nature of the predictive model well, which can result in misleading interpretations. We provide a tractable dynamic programming algorithm that partitions the feature space into subsets and assigns a local model (constant/linear model) to provide piecewise constant/piecewise linear interpretations of an arbitrary predictive model. When approximation loss (between the interpretation and the predictive model) is measured in terms of mean squared error, our approximation is optimal; for more general loss functions, our interpretation is approximately optimal. Therefore, in both cases it probably approximately correctly (PAC) learns the predictive model. Experiments with real and synthetic data show that it provides significant improvements (in terms of mean squared error) over competing approaches. We also show real use cases to establish the utility of the proposed approach over competing approaches.