论文信息 - Mad Max: Affine Spline Insights Into Deep Learning

Mad Max: Affine Spline Insights Into Deep Learning

We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classification performance and reduced overfitting with no change to the DN architecture. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantization (VQ) and $K$-means clustering, which opens up new geometric avenue to study how DNs organize signals in a hierarchical fashion. To validate the utility of the VQ interpretation, we develop and validate a new distance metric for signals and images that quantifies the difference between their VQ encodings. (This paper is a significantly expanded version of A Spline Theory of Deep Learning from ICML 2018.)

Richard Baraniuk | Randall Balestriero | Richard Baraniuk | Randall Balestriero

[1] Serge J. Belongie,et al. Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[2] Stephen P. Boyd,et al. Convex piecewise-linear fitting , 2009 .

[3] Leo Breiman,et al. Hinging hyperplanes for regression, classification, and function approximation , 1993, IEEE Trans. Inf. Theory.

[4] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5] Richard G. Baraniuk,et al. A Spline Theory of Deep Networks , 2018, ICML.

[6] Stefano Soatto,et al. Visual Representations: Defining Properties and Deep Approximations , 2014, ICLR 2016.

[7] Elad Hoffer,et al. Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[8] Sankar K. Pal,et al. Multilayer perceptron, fuzzy sets, and classification , 1992, IEEE Trans. Neural Networks.

[9] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[10] Stefano Soatto,et al. Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[11] Razvan Pascanu,et al. Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12] Huu Le,et al. DeepVQ: A Deep Network Architecture for Vector Quantization , 2018, CVPR Workshops.

[13] Anil K. Jain. Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[14] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[15] Hiroaki Nishikawa,et al. Accurate Piecewise Linear Continuous Approximations to One-Dimensional Curves : Error Estimates and Algorithms , 2010 .

[16] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[17] David B. Dunson,et al. Multivariate convex regression with adaptive partitioning , 2011, J. Mach. Learn. Res..

[18] C. K. Yuen,et al. Theory and Application of Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[19] Richard Baraniuk,et al. Max-Affine Spline Insights into Deep Generative Networks , 2020, ArXiv.

[20] Robert Hecht-Nielsen,et al. Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[21] Fei Wen,et al. An improved vector quantization method using deep neural network , 2017 .

[22] Joon Hee Han,et al. Local Decorrelation For Improved Pedestrian Detection , 2014, NIPS.

[23] Nadav Cohen,et al. On the Expressive Power of Deep Learning: A Tensor Analysis , 2015, COLT 2016.

[24] Aditya Bhaskara,et al. Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[25] Richard G. Baraniuk,et al. A Probabilistic Framework for Deep Learning , 2016, NIPS.

[26] Jürgen Schmidhuber,et al. Flat Minima , 1997, Neural Computation.

[27] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28] Lei Xu,et al. Input Convex Neural Networks : Supplementary Material , 2017 .

[29] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[30] Nasser M. Nasrabadi,et al. Image coding using vector quantization: a review , 1988, IEEE Trans. Commun..

[31] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[32] Behnaam Aazhang,et al. The Geometry of Deep Networks: Power Diagram Subdivision , 2019, NeurIPS.

[33] Richard G. Baraniuk,et al. Optimal tree approximation with wavelets , 1999, Optics & Photonics.

[34] Ronald,et al. Learning representations by backpropagating errors , 2004 .

[35] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Geoffrey E. Hinton,et al. Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[37] Wotao Yin,et al. A Block Coordinate Descent Method for Regularized Multiconvex Optimization with Applications to Nonnegative Tensor Factorization and Completion , 2013, SIAM J. Imaging Sci..

[38] Pascal Vincent,et al. An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family , 2015, ICLR.

[39] E. Beckenbach. CONVEX FUNCTIONS , 2007 .

[40] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41] J. Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[42] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[43] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[44] Diogo Almeida,et al. Resnet in Resnet: Generalizing Residual Architectures , 2016, ArXiv.

[45] M. E. Botkin,et al. Structural shape optimization with geometric description and adaptive mesh refinement , 1985 .

[46] Andrew Zisserman,et al. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[47] Harris Drucker,et al. Learning algorithms for classification: A comparison on handwritten digit recognition , 1995 .

[48] Bahman Gharesifard,et al. Universal Approximation Power of Deep Neural Networks via Nonlinear Control Theory , 2020, ArXiv.

[49] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[50] Michael I. Jordan,et al. Factorial Hidden Markov Models , 1995, Machine Learning.

[51] Zoubin Ghahramani,et al. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[52] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[53] J. M. Tarela,et al. Region configurations for realizability of lattice Piecewise-Linear models , 1999 .

[54] Michael Elad,et al. Convolutional Neural Networks Analyzed via Convolutional Sparse Coding , 2016, J. Mach. Learn. Res..

[55] Zichao Wang,et al. A Max-Affine Spline Perspective of Recurrent Neural Networks , 2019, ICLR.

[56] Samy Bengio,et al. Adversarial examples in the physical world , 2016, ICLR.

[57] Blaine Rister,et al. Piecewise convexity of artificial neural networks , 2016, Neural Networks.

[58] Jürgen Schmidhuber,et al. Training Very Deep Networks , 2015, NIPS.

[59] Naftali Tishby,et al. Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[60] M. Powell,et al. Approximation theory and methods , 1984 .

[61] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[62] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[63] R. Zemel. A minimum description length framework for unsupervised learning , 1994 .

[64] Raman Arora,et al. Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[65] Shuning Wang,et al. Generalization of hinging hyperplanes , 2005, IEEE Transactions on Information Theory.

[66] Richard G. Baraniuk,et al. From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference , 2018, ICLR.

[67] Andrej Risteski,et al. Representational Power of ReLU Networks and Polynomial Kernels: Beyond Worst-Case Analysis , 2018, ArXiv.

[68] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[69] Razvan Pascanu,et al. On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[70] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[71] George Cybenko,et al. Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[72] Michael Unser,et al. A representer theorem for deep neural networks , 2018, J. Mach. Learn. Res..

[73] Stefanie Jegelka,et al. ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[74] Luca Benini,et al. Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations , 2017, NIPS.

[75] Aaron C. Courville,et al. Deep Learning Vector Quantization , 2016, ESANN.

[76] Marc Levoy,et al. Fast texture synthesis using tree-structured vector quantization , 2000, SIGGRAPH.

[77] Stéphane Mallat,et al. Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[78] Haihao Lu,et al. Depth Creates No Bad Local Minima , 2017, ArXiv.

[79] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[80] Ming Yang,et al. Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[81] G. Petrova,et al. Nonlinear Approximation and (Deep) ReLU\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {ReLU}$$\end{document} , 2019, Constructive Approximation.

[82] Yonina C. Eldar,et al. Orthogonal matched filter detection , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[83] Justin K. Romberg,et al. Bayesian tree-structured image modeling using wavelet-domain hidden Markov models , 2001, IEEE Trans. Image Process..

[84] S. Mallat. A wavelet tour of signal processing , 1998 .

[85] Richard Baraniuk,et al. Implicit Rugosity Regularization via Data Augmentation , 2019 .

[86] Yoshua Bengio,et al. Maxout Networks , 2013, ICML.