DeepNNK: Explaining deep models and their generalization using polytope interpolation

Modern machine learning systems based on neural networks have shown great success in learning complex data patterns while being able to make good predictions on unseen data points. However, the limited interpretability of these systems hinders further progress and application to several domains in the real world. This predicament is exemplified by time consuming model selection and the difficulties faced in predictive explainability, especially in the presence of adversarial examples. In this paper, we take a step towards better understanding of neural networks by introducing a local polytope interpolation method. The proposed Deep Non Negative Kernel regression (NNK) interpolation framework is non parametric, theoretically simple and geometrically intuitive. We demonstrate instance based explainability for deep learning models and develop a method to identify models with good generalization properties using leave one out estimation. Finally, we draw a rationalization to adversarial and generative examples which are inevitable from an interpolation view of machine learning.

[1]  P. Massart,et al.  Risk bounds for statistical learning , 2007, math/0702683.

[2]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[3]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[4]  Vincent Gripon,et al.  Introducing Graph Smoothness Loss for Training Deep Learning Architectures , 2019, 2019 IEEE Data Science Workshop (DSW).

[5]  Matthias Hein,et al.  Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks , 2020, ICML.

[6]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[7]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[8]  Antonio Ortega,et al.  Graph Construction from Data using Non Negative Kernel regression (NNK Graphs) , 2019, ArXiv.

[9]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[10]  C. A. Rogers Covering a sphere with spheres , 1963 .

[11]  Andrew Gordon Wilson,et al.  Deep Kernel Learning , 2015, AISTATS.

[12]  Luc Devroye,et al.  Lectures on the Nearest Neighbor Method , 2015 .

[13]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[14]  L. Devroye,et al.  The Hilbert Kernel Regression Estimate , 1998 .

[15]  M. Pontil Leave-one-out error and stability of learning algorithms with applications , 2002 .

[16]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[17]  Oluwasanmi Koyejo,et al.  Examples are not enough, learn to criticize! Criticism for Interpretability , 2016, NIPS.

[18]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[19]  W. Rogers,et al.  A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules , 1978 .

[20]  Avrim Blum,et al.  The Ladder: A Reliable Leaderboard for Machine Learning Competitions , 2015, ICML.

[21]  H. Bernau Active Constraint Strategies in Optimization , 1990 .

[22]  Davide Castelvecchi,et al.  Can we open the black box of AI? , 2016, Nature.

[23]  Shi Feng,et al.  Interpreting Neural Networks with Nearest Neighbors , 2018, BlackboxNLP@EMNLP.

[24]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[25]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[26]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[27]  Ulrich Anders,et al.  Model selection in neural networks , 1999, Neural Networks.

[28]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[29]  Antonio Ortega,et al.  Graph Construction from Data by Non-Negative Kernel Regression , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[31]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[32]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[34]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[35]  Benjamin Recht,et al.  Do CIFAR-10 Classifiers Generalize to CIFAR-10? , 2018, ArXiv.

[36]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[37]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[38]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[39]  Geraint Rees,et al.  Clinically applicable deep learning for diagnosis and referral in retinal disease , 2018, Nature Medicine.

[40]  Patrick D. McDaniel,et al.  Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning , 2018, ArXiv.

[41]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[42]  Stability , 1973 .

[43]  Charles Soussen,et al.  Non-Negative Orthogonal Greedy Algorithms , 2019, IEEE Transactions on Signal Processing.

[44]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[45]  Long Chen New analysis of the sphere covering problems and optimal polytope approximation of convex bodies , 2005, J. Approx. Theory.