Enhancing Simple Models by Exploiting What They Already Know

There has been recent interest in improving performance of simple models for multiple reasons such as interpretability, robust learning from small data, deployment in memory constrained settings as well as environmental considerations. In this paper, we propose a novel method SRatio that can utilize information from high performing complex models (viz. deep neural networks, boosted trees, random forests) to reweight a training dataset for a potentially low performing simple model of much lower complexity such as a decision tree or a shallow network enhancing its performance. Our method also leverages the per sample hardness estimate of the simple model which is not the case with the prior works which primarily consider the complex model's confidences/predictions and is thus conceptually novel. Moreover, we generalize and formalize the concept of attaching probes to intermediate layers of a neural network to other commonly used classifiers and incorporate this into our method. The benefit of these contributions is witnessed in the experiments where on 6 UCI datasets and CIFAR-10 we outperform competitors in a majority (16 out of 27) of the cases and tie for best performance in the remaining cases. In fact, in a couple of cases, we even approach the complex model's performance. We also conduct further experiments to validate assertions and intuitively understand why our method works. Theoretically, we motivate our approach by showing that the weighted loss minimized by simple models using our weighting upper bounds the loss of the complex model.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[3]  Rich Caruana,et al.  Auditing Black-Box Models Using Transparent Model Distillation With Side Information , 2017 .

[4]  Amit Dhurandhar,et al.  Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives , 2018, NeurIPS.

[5]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[6]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[7]  Bernd Bischl,et al.  Quantifying Interpretability of Arbitrary Machine Learning Models Through Functional Decomposition , 2019, ArXiv.

[8]  Pramod K. Varshney,et al.  Why Interpretability in Machine Learning? An Answer Using Distributed Detection and Data Fusion Theory , 2018, ArXiv.

[9]  R. Venkatesh Babu,et al.  Confidence estimation in Deep Neural networks via density modelling , 2017, ArXiv.

[10]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[11]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[12]  Osbert Bastani,et al.  Interpreting Blackbox Models via Model Extraction , 2017, ArXiv.

[13]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[14]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[15]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[16]  Cynthia Rudin,et al.  Please Stop Explaining Black Box Models for High Stakes Decisions , 2018, ArXiv.

[17]  Amit Dhurandhar,et al.  Improving Simple Models with Confidence Profiles , 2018, NeurIPS.

[18]  Amit Dhurandhar,et al.  TIP: Typifying the Interpretability of Procedures , 2017, ArXiv.

[19]  Geoffrey E. Hinton,et al.  Distilling a Neural Network Into a Soft Decision Tree , 2017, CEx@AI*IA.

[20]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[21]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[22]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[23]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[24]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[25]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[26]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[27]  Alexander J. Smola,et al.  Linear-Time Estimators for Propensity Scores , 2011, AISTATS.

[28]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[29]  Bernhard Schölkopf,et al.  Fidelity-Weighted Learning , 2017, ICLR.

[30]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[31]  Wojciech Samek,et al.  Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..