The Connection between DNNs and Classic Classifiers: Generalize, Memorize, or Both?.

This work studies the relationship between the classification performed by deep neural networks (DNNs) and the decision of various classic classifiers, namely $k$-nearest neighbors ($k$-NN), support vector machines (SVM), and logistic regression (LR). This is studied at various layers of the network, providing us with new insights on the ability of DNNs to both memorize the training data and generalize to new data at the same time, where $k$-NN serves as the ideal estimator that perfectly memorizes the data. First, we show that DNNs' generalization improves gradually along their layers and that memorization of non-generalizing networks happens only at the last layers. We also observe that the behavior of DNNs compared to the linear classifiers SVM and LR is quite the same on the training and test data regardless of whether the network generalizes. On the other hand, the similarity to $k$-NN holds only at the absence of overfitting. This suggests that the $k$-NN behavior of the network on new data is a good sign of generalization. Moreover, this allows us to use existing $k$-NN theory for DNNs.

[1]  Patrick D. McDaniel,et al.  Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning , 2018, ArXiv.

[2]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[3]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[5]  Wei Zhu,et al.  Stop memorizing : A data-dependent regularization framework for intrinsic pattern learning , 2018 .

[6]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[8]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[9]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[11]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[12]  Maya R. Gupta,et al.  To Trust Or Not To Trust A Classifier , 2018, NeurIPS.

[13]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[14]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[15]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[16]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[17]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[18]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[19]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[20]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[21]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[22]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[23]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[24]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[25]  Somesh Jha,et al.  Analyzing the Robustness of Nearest Neighbors to Adversarial Examples , 2017, ICML.

[26]  Nikolaos Doulamis,et al.  Deep Learning for Computer Vision: A Brief Review , 2018, Comput. Intell. Neurosci..

[27]  N. Fisher,et al.  Spatial logistic regression and change-of-support in Poisson point processes , 2010 .

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[30]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[31]  Gustavo Camps-Valls,et al.  On the Suitable Domain for SVM Training in Image Coding , 2008, J. Mach. Learn. Res..

[32]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2016, IEEE Transactions on Signal Processing.

[33]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[34]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[35]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[36]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[37]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[38]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[39]  Kiyoharu Aizawa,et al.  Significance of Softmax-Based Features over Metric Learning-Based Features , 2017 .

[40]  László Györfi,et al.  Rate of Convergence of $k$-Nearest-Neighbor Classification Rule , 2017, J. Mach. Learn. Res..