A representer theorem for deep neural networks

We propose to optimize the activation functions of a deep neural network by adding a corresponding functional regularization to the cost function. We justify the use of a second-order total-variation criterion. This allows us to derive a general representer theorem for deep neural networks that makes a direct connection with splines and sparsity. Specifically, we show that the optimal network configuration can be achieved with activation functions that are nonuniform linear splines with adaptive knots. The bottom line is that the action of each neuron is encoded by a spline whose parameters (including the number of knots) are optimized during the training procedure. The scheme results in a computational structure that is compatible with the existing deep-ReLU and MaxOut architectures. It also suggests novel optimization challenges, while making the link with $\ell_1$ minimization and sparsity-promoting techniques explicit.

[1]  I. J. Schoenberg,et al.  SPLINE FUNCTIONS AND THE PROBLEM OF GRADUATION , 1964 .

[2]  Michael Unser,et al.  Continuous-Domain Solutions of Linear Inverse Problems With Tikhonov Versus Generalized TV Regularization , 2018, IEEE Transactions on Signal Processing.

[3]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[4]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[5]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[6]  Carl de Boor,et al.  A Practical Guide to Splines , 1978, Applied Mathematical Sciences.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Joel H. Saltz,et al.  ConvNets with Smooth Adaptive Activation Functions for Regression , 2017, AISTATS.

[9]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[10]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..

[11]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[12]  Holger Rauhut,et al.  A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[13]  J. M. Tarela,et al.  Region configurations for realizability of lattice Piecewise-Linear models , 1999 .

[14]  Francesco Piazza,et al.  Learning and Approximation Capabilities of Adaptive Spline Activation Function Neural Networks , 1998, Neural Networks.

[15]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[16]  Tomaso Poggio,et al.  Notes on Hierarchical Splines, DCLNs and i-theory , 2015 .

[17]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[18]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[19]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[20]  Stephen H. Lane,et al.  Multi-Layer Perceptrons with B-Spline Receptive Field Functions , 1990, NIPS.

[21]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[23]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[24]  Raman Arora,et al.  Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[25]  Michael Unser,et al.  Splines: a perfect fit for signal and image processing , 1999, IEEE Signal Process. Mag..

[26]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[27]  Pierre Baldi,et al.  Learning Activation Functions to Improve Deep Neural Networks , 2014, ICLR.

[28]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[29]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[30]  D. Donoho For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .

[31]  Joseph W. Jerome,et al.  Spline solutions to L1 extremal problems in one and several variables , 1975 .

[32]  Francesco Piazza,et al.  Multilayer feedforward networks with adaptive spline activation function , 1999, IEEE Trans. Neural Networks.

[33]  Michael Unser,et al.  Representer Theorems for Sparsity-Promoting $\ell _{1}$ Regularization , 2016, IEEE Transactions on Information Theory.

[34]  P. M. Prenter Splines and variational methods , 1975 .

[35]  Michael Griebel,et al.  A representer theorem for deep kernel learning , 2017, J. Mach. Learn. Res..

[36]  Neil D. Lawrence,et al.  Kernels for Vector-Valued Functions: a Review , 2011, Found. Trends Mach. Learn..

[37]  I J Schoenberg,et al.  SPLINE FUNCTIONS AND THE PROBLEM OF GRADUATION. , 1964, Proceedings of the National Academy of Sciences of the United States of America.

[38]  G. Wahba Spline models for observational data , 1990 .

[39]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[40]  L. Schumaker Spline Functions: Basic Theory , 1981 .

[41]  Shuning Wang,et al.  Generalization of hinging hyperplanes , 2005, IEEE Transactions on Information Theory.

[42]  W. Rudin Real and complex analysis , 1968 .

[43]  S. Geer,et al.  Locally adaptive regression splines , 1997 .

[44]  Holger Wendland,et al.  Scattered Data Approximation: Conditionally positive definite functions , 2004 .

[45]  Michael Unser,et al.  Splines Are Universal Solutions of Linear Inverse Problems with Generalized TV Regularization , 2016, SIAM Rev..

[46]  T. Poggio,et al.  The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.

[47]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[48]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[49]  C. D. Boor,et al.  On splines and their minimum properties , 1966 .

[50]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..