Learning Activation Functions in Deep (Spline) Neural Networks

We develop an efficient computational solution to train deep neural networks (DNN) with free-form activation functions. To make the problem well-posed, we augment the cost functional of the DNN by adding an appropriate shape regularization: the sum of the second-order total-variations of the trainable nonlinearities. The representer theorem for DNNs tells us that the optimal activation functions are adaptive piecewise-linear splines, which allows us to recast the problem as a parametric optimization. The challenging point is that the corresponding basis functions (ReLUs) are poorly conditioned and that the determination of their number and positioning is also part of the problem. We circumvent the difficulty by using an equivalent B-spline basis to encode the activation functions and by expressing the regularization as an $\ell _1$-penalty. This results in the specification of parametric activation function modules that can be implemented and optimized efficiently on standard development platforms. We present experimental results that demonstrate the benefit of our approach.

[1]  Robert D. Nowak,et al.  Banach Space Representer Theorems for Neural Networks and Ridge Splines , 2020, J. Mach. Learn. Res..

[2]  Robert D. Nowak,et al.  The Role of Neural Network Activation Functions , 2020, IEEE Signal Processing Letters.

[3]  Hassan Mansour,et al.  Learning Optimal Nonlinearities for Iterative Thresholding Algorithms , 2015, IEEE Signal Processing Letters.

[4]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[5]  D. Donoho,et al.  Sparse MRI: The application of compressed sensing for rapid MR imaging , 2007, Magnetic resonance in medicine.

[6]  Jin Keun Seo,et al.  Deep learning for undersampled MRI reconstruction , 2017, Physics in medicine and biology.

[7]  Michael Elad,et al.  Theoretical Foundations of Deep Learning via Sparse Representations: A Multilayer Sparse Model and Its Connection to Convolutional Neural Networks , 2018, IEEE Signal Processing Magazine.

[8]  Pierre Baldi,et al.  Learning Activation Functions to Improve Deep Neural Networks , 2014, ICLR.

[9]  Helmut Bölcskei,et al.  Optimal Approximation with Sparsely Connected Deep Neural Networks , 2017, SIAM J. Math. Data Sci..

[10]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[11]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[12]  Yunjin Chen,et al.  Trainable Nonlinear Reaction Diffusion: A Flexible Framework for Fast and Effective Image Restoration , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Michael Unser,et al.  Learning Convex Regularizers for Optimal Bayesian Denoising , 2017, IEEE Transactions on Signal Processing.

[14]  Wei-Der Chang,et al.  A feedforward neural network with function shape autotuning , 1996, Neural Networks.

[15]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[16]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[19]  Francesco Piazza,et al.  Learning and Approximation Capabilities of Adaptive Spline Activation Function Neural Networks , 1998, Neural Networks.

[20]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[22]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[23]  Raman Arora,et al.  Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[24]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[25]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[26]  Richard G. Baraniuk,et al.  A Spline Theory of Deep Learning , 2018, ICML 2018.

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Joel H. Saltz,et al.  ConvNets with Smooth Adaptive Activation Functions for Regression , 2017, AISTATS.

[29]  J. M. Tarela,et al.  Region configurations for realizability of lattice Piecewise-Linear models , 1999 .

[30]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[31]  Shuning Wang,et al.  Generalization of hinging hyperplanes , 2005, IEEE Transactions on Information Theory.

[32]  Tomaso Poggio,et al.  Notes on Hierarchical Splines, DCLNs and i-theory , 2015 .

[33]  Michael Unser,et al.  Deep Neural Networks With Trainable Activations and Controlled Lipschitz Constant , 2020, IEEE Transactions on Signal Processing.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Michael Elad,et al.  Applications of Sparse Representation and Compressive Sensing , 2010, Proc. IEEE.

[36]  Rémi Gribonval,et al.  Approximation Spaces of Deep Neural Networks , 2019, Constructive Approximation.

[37]  Helmut Bölcskei,et al.  The universal approximation power of finite-width deep ReLU networks , 2018, ArXiv.

[38]  Michael Unser,et al.  Deep Convolutional Neural Network for Inverse Problems in Imaging , 2016, IEEE Transactions on Image Processing.

[39]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[40]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[41]  Kristian Kersting,et al.  Padé Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks , 2019, ICLR.

[42]  Stephen H. Lane,et al.  Multi-Layer Perceptrons with B-Spline Receptive Field Functions , 1990, NIPS.

[43]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[44]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[45]  Yunchao Wei,et al.  Deep Learning with S-Shaped Rectified Linear Activation Units , 2015, AAAI.

[46]  Francesco Piazza,et al.  Multilayer feedforward networks with adaptive spline activation function , 1999, IEEE Trans. Neural Networks.

[47]  Michael Unser,et al.  Convolutional Neural Networks for Inverse Problems in Imaging: A Review , 2017, IEEE Signal Processing Magazine.

[48]  Michael Unser,et al.  A representer theorem for deep neural networks , 2018, J. Mach. Learn. Res..

[49]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[50]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.