DNN or k-NN: That is the Generalize vs. Memorize Question

This paper studies the relationship between the classification performed by deep neural networks (DNNs) and the decision of various classical classifiers, namely k-nearest neighbours (k-NN), support vector machines (SVM) and logistic regression (LR), at various layers of the network. This comparison provides us with new insights as to the ability of neural networks to both memorize the training data and generalize to new data at the same time, where k-NN serves as the ideal estimator that perfectly memorizes the data. We show that memorization of non-generalizing networks happens only at the last layers. Moreover, the behavior of DNNs compared to the linear classifiers SVM and LR is quite the same on the training and test data regardless of whether the network generalizes. On the other hand, the similarity to k-NN holds only at the absence of overfitting. Our results suggests that k-NN behavior of the network on new data is a sign of generalization. Moreover, it shows that memorization and generalization, which are traditionally considered to be contradicting to each other, are compatible and complementary.

[1]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  Nikolaos Doulamis,et al.  Deep Learning for Computer Vision: A Brief Review , 2018, Comput. Intell. Neurosci..

[5]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[6]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[7]  Patrick D. McDaniel,et al.  Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning , 2018, ArXiv.

[8]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[9]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[10]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[11]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[12]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[13]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[15]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[16]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Kiyoharu Aizawa,et al.  Significance of Softmax-Based Features in Comparison to Distance Metric Learning-Based Features , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[19]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[20]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2016, IEEE Transactions on Signal Processing.

[21]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[22]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[23]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[24]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[25]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[26]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27]  Lorenzo Rosasco,et al.  Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.

[28]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[29]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[30]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.