A Flexible Nadaraya-Watson Head Can Offer Explainable and Calibrated Classification

In this paper, we empirically analyze a simple, non-learnable, and nonparametric Nadaraya-Watson (NW) prediction head that can be used with any neural network architecture. In the NW head, the prediction is a weighted average of labels from a support set. The weights are computed from distances between the query feature and support features. This is in contrast to the dominant approach of using a learnable classification head (e.g., a fully-connected layer) on the features, which can be challenging to interpret and can yield poorly calibrated predictions. Our empirical results on an array of computer vision tasks demonstrate that the NW head can yield better calibration with comparable accuracy compared to its parametric counterpart, particularly in data-limited settings. To further increase inference-time efficiency, we propose a simple approach that involves a clustering step run on the training set to create a relatively small distilled support set. Furthermore, we explore two means of interpretability/explainability that fall naturally from the NW head. The first is the label weights, and the second is our novel concept of the ``support influence function,'' which is an easy-to-compute metric that quantifies the influence of a support element on the prediction for a given query. As we demonstrate in our experiments, the influence function can allow the user to debug a trained model. We believe that the NW head is a flexible, interpretable, and highly useful building block that can be used in a range of applications.

[1]  Mohammad Reza Taesiri,et al.  Visual correspondence-based explanations improve AI robustness and human-AI team accuracy , 2022, NeurIPS.

[2]  C. Schmid,et al.  Learning with Neighbor Consistency for Noisy Labels , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Diego de Las Casas,et al.  Improving language models by retrieving from trillions of tokens , 2021, ICML.

[4]  D. Dou,et al.  Interpretable deep learning: interpretation, interpretability, trustworthiness, and beyond , 2021, Knowledge and Information Systems.

[5]  P. Fieguth,et al.  Deep Learning for Instance Retrieval: A Survey , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Jianguang Fang,et al.  Hybrid Learning Algorithm of Radial Basis Function Networks for Reliability Analysis , 2021, IEEE Transactions on Reliability.

[7]  Xiaohua Zhai,et al.  Revisiting the Calibration of Modern Neural Networks , 2021, NeurIPS.

[8]  Dani Yogatama,et al.  End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering , 2021, NeurIPS.

[9]  Aidan N. Gomez,et al.  Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning , 2021, NeurIPS.

[10]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[11]  Luca Bertinetto,et al.  On Episodes, Prototypical Networks, and Few-shot Learning , 2020, NeurIPS.

[12]  Jaehoon Lee,et al.  Dataset Meta-Learning from Kernel Ridge-Regression , 2020, ICLR.

[13]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[14]  Samyadeep Basu,et al.  Influence Functions in Deep Learning Are Fragile , 2020, ICLR.

[15]  Sho Yokoi,et al.  Evaluation of Similarity-based Explanations , 2021, ICLR.

[16]  Alexander J. Smola,et al.  TraDE: Transformers for Density Estimation , 2020, ArXiv.

[17]  Jinwoo Shin,et al.  Regularizing Class-Wise Predictions via Self-Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xiu-Shen Wei,et al.  BBN: Bilateral-Branch Network With Cumulative Learning for Long-Tailed Visual Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Geoffrey E. Hinton,et al.  Analyzing and Improving Representations with the Soft Nearest Neighbor Loss , 2019, ICML.

[20]  Cynthia Rudin,et al.  This Looks Like That: Deep Learning for Interpretable Image Recognition , 2018 .

[21]  Quanshi Zhang,et al.  Interpreting CNNs via Decision Trees , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Dahua Lin,et al.  Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination , 2018, ArXiv.

[23]  Patrick D. McDaniel,et al.  Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning , 2018, ArXiv.

[24]  Yi Yang,et al.  Diagnose like a Radiologist: Attention Guided Convolutional Neural Network for Thorax Disease Classification , 2018, ArXiv.

[25]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Marc Peter Deisenroth,et al.  Doubly Stochastic Variational Inference for Deep Gaussian Processes , 2017, NIPS.

[29]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[31]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[32]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[34]  Daan Wierstra,et al.  One-shot Learning with Memory-Augmented Neural Networks , 2016, ArXiv.

[35]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[36]  Daniel Hernández-Lobato,et al.  Deep Gaussian Processes for Regression using Approximate Expectation Propagation , 2016, ICML.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Andrew Gordon Wilson,et al.  Deep Kernel Learning , 2015, AISTATS.

[40]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[41]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[42]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[43]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[44]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[45]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[46]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[47]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[48]  Joydeep Ghosh,et al.  An overview of radial basis function networks , 2001 .

[49]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[50]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[51]  E. Nadaraya On Estimating Regression , 1964 .