Mad Max: Affine Spline Insights Into Deep Learning

We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classification performance and reduced overfitting with no change to the DN architecture. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantization (VQ) and $K$-means clustering, which opens up new geometric avenue to study how DNs organize signals in a hierarchical fashion. To validate the utility of the VQ interpretation, we develop and validate a new distance metric for signals and images that quantifies the difference between their VQ encodings. (This paper is a significantly expanded version of A Spline Theory of Deep Learning from ICML 2018.)

[1]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[2]  Stephen P. Boyd,et al.  Convex piecewise-linear fitting , 2009 .

[3]  Leo Breiman,et al.  Hinging hyperplanes for regression, classification, and function approximation , 1993, IEEE Trans. Inf. Theory.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Richard G. Baraniuk,et al.  A Spline Theory of Deep Networks , 2018, ICML.

[6]  Stefano Soatto,et al.  Visual Representations: Defining Properties and Deep Approximations , 2014, ICLR 2016.

[7]  Elad Hoffer,et al.  Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[8]  Sankar K. Pal,et al.  Multilayer perceptron, fuzzy sets, and classification , 1992, IEEE Trans. Neural Networks.

[9]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[10]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[11]  Razvan Pascanu,et al.  Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Huu Le,et al.  DeepVQ: A Deep Network Architecture for Vector Quantization , 2018, CVPR Workshops.

[13]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[14]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[15]  Hiroaki Nishikawa,et al.  Accurate Piecewise Linear Continuous Approximations to One-Dimensional Curves : Error Estimates and Algorithms , 2010 .

[16]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[17]  David B. Dunson,et al.  Multivariate convex regression with adaptive partitioning , 2011, J. Mach. Learn. Res..

[18]  C. K. Yuen,et al.  Theory and Application of Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[19]  Richard Baraniuk,et al.  Max-Affine Spline Insights into Deep Generative Networks , 2020, ArXiv.

[20]  Robert Hecht-Nielsen,et al.  Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[21]  Fei Wen,et al.  An improved vector quantization method using deep neural network , 2017 .

[22]  Joon Hee Han,et al.  Local Decorrelation For Improved Pedestrian Detection , 2014, NIPS.

[23]  Nadav Cohen,et al.  On the Expressive Power of Deep Learning: A Tensor Analysis , 2015, COLT 2016.

[24]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[25]  Richard G. Baraniuk,et al.  A Probabilistic Framework for Deep Learning , 2016, NIPS.

[26]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Lei Xu,et al.  Input Convex Neural Networks : Supplementary Material , 2017 .

[29]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[30]  Nasser M. Nasrabadi,et al.  Image coding using vector quantization: a review , 1988, IEEE Trans. Commun..

[31]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[32]  Behnaam Aazhang,et al.  The Geometry of Deep Networks: Power Diagram Subdivision , 2019, NeurIPS.

[33]  Richard G. Baraniuk,et al.  Optimal tree approximation with wavelets , 1999, Optics & Photonics.

[34]  Ronald,et al.  Learning representations by backpropagating errors , 2004 .

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[37]  Wotao Yin,et al.  A Block Coordinate Descent Method for Regularized Multiconvex Optimization with Applications to Nonnegative Tensor Factorization and Completion , 2013, SIAM J. Imaging Sci..

[38]  Pascal Vincent,et al.  An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family , 2015, ICLR.

[39]  E. Beckenbach CONVEX FUNCTIONS , 2007 .

[40]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  J. Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[42]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[43]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[44]  Diogo Almeida,et al.  Resnet in Resnet: Generalizing Residual Architectures , 2016, ArXiv.

[45]  M. E. Botkin,et al.  Structural shape optimization with geometric description and adaptive mesh refinement , 1985 .

[46]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[47]  Harris Drucker,et al.  Learning algorithms for classification: A comparison on handwritten digit recognition , 1995 .

[48]  Bahman Gharesifard,et al.  Universal Approximation Power of Deep Neural Networks via Nonlinear Control Theory , 2020, ArXiv.

[49]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[50]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[51]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[52]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[53]  J. M. Tarela,et al.  Region configurations for realizability of lattice Piecewise-Linear models , 1999 .

[54]  Michael Elad,et al.  Convolutional Neural Networks Analyzed via Convolutional Sparse Coding , 2016, J. Mach. Learn. Res..

[55]  Zichao Wang,et al.  A Max-Affine Spline Perspective of Recurrent Neural Networks , 2019, ICLR.

[56]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[57]  Blaine Rister,et al.  Piecewise convexity of artificial neural networks , 2016, Neural Networks.

[58]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[59]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[60]  M. Powell,et al.  Approximation theory and methods , 1984 .

[61]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[62]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[63]  R. Zemel A minimum description length framework for unsupervised learning , 1994 .

[64]  Raman Arora,et al.  Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[65]  Shuning Wang,et al.  Generalization of hinging hyperplanes , 2005, IEEE Transactions on Information Theory.

[66]  Richard G. Baraniuk,et al.  From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference , 2018, ICLR.

[67]  Andrej Risteski,et al.  Representational Power of ReLU Networks and Polynomial Kernels: Beyond Worst-Case Analysis , 2018, ArXiv.

[68]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[69]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[70]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[71]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[72]  Michael Unser,et al.  A representer theorem for deep neural networks , 2018, J. Mach. Learn. Res..

[73]  Stefanie Jegelka,et al.  ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[74]  Luca Benini,et al.  Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations , 2017, NIPS.

[75]  Aaron C. Courville,et al.  Deep Learning Vector Quantization , 2016, ESANN.

[76]  Marc Levoy,et al.  Fast texture synthesis using tree-structured vector quantization , 2000, SIGGRAPH.

[77]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[78]  Haihao Lu,et al.  Depth Creates No Bad Local Minima , 2017, ArXiv.

[79]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[80]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[81]  G. Petrova,et al.  Nonlinear Approximation and (Deep) ReLU\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {ReLU}$$\end{document} , 2019, Constructive Approximation.

[82]  Yonina C. Eldar,et al.  Orthogonal matched filter detection , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[83]  Justin K. Romberg,et al.  Bayesian tree-structured image modeling using wavelet-domain hidden Markov models , 2001, IEEE Trans. Image Process..

[84]  S. Mallat A wavelet tour of signal processing , 1998 .

[85]  Richard Baraniuk,et al.  Implicit Rugosity Regularization via Data Augmentation , 2019 .

[86]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.