Measuring Complexity of Learning Schemes Using Hessian-Schatten Total-Variation

In this paper, we introduce the Hessian-Schatten total-variation (HTV)— a novel seminorm that quantifies the total “rugosity” of multivariate functions. Our motivation for defining HTV is to assess the complexity of supervised learning schemes. We start by specifying the adequate matrixvalued Banach spaces that are equipped with suitable classes of mixednorms. We then show that HTV is invariant to rotations, scalings, and translations. Additionally, its minimum value is achieved for linear mappings, supporting the common intuition that linear regression is the least complex learning model. Next, we present closed-form expressions for computing the HTV of two general classes of functions. The first one is the class of Sobolev functions with a certain degree of regularity, for which we show that HTV coincides with the Hessian-Schatten seminorm that is sometimes used as a regularizer for image reconstruction. The second one is the class of continuous and piecewise linear (CPWL) functions. In this case, we show that the HTV reflects the total change in slopes between linear regions that have a common facet. Hence, it can be viewed as a convex relaxation (`1-type) of the number of linear regions (`0-type) of CPWL mappings. Finally, we illustrate the use of our proposed seminorm with some concrete examples.

[1]  Petros Maragos,et al.  Structure Tensor Total Variation , 2015, SIAM J. Imaging Sci..

[2]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[3]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[4]  M. Unser,et al.  Learning of Continuous and Piecewise-Linear Functions With Hessian Total-Variation Regularization , 2022, IEEE Open Journal of Signal Processing.

[5]  Karl Kunisch,et al.  Total Generalized Variation , 2010, SIAM J. Imaging Sci..

[6]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[7]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[8]  Otmar Scherzer,et al.  Variational Methods on the Space of Functions of Bounded Hessian for Convexification and Denoising , 2005, Computing.

[9]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[10]  Diganta Misra,et al.  Mish: A Self Regularized Non-Monotonic Neural Activation Function , 2019, ArXiv.

[11]  Nathan Srebro,et al.  How do infinite width bounded norm networks look in function space? , 2019, COLT.

[12]  Michael Unser,et al.  Duality Mapping for Schatten Matrix Norms , 2020, Numerical Functional Analysis and Optimization.

[13]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[14]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[15]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[16]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[17]  Françoise Demengel,et al.  Fonctions à hessien borné , 1984 .

[18]  Michael Unser,et al.  Hessian-Based Norm Regularization for Image Restoration With Biomedical Applications , 2012, IEEE Transactions on Image Processing.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Michael Unser,et al.  Deep Neural Networks With Trainable Activations and Controlled Lipschitz Constant , 2020, IEEE Transactions on Signal Processing.

[21]  Andrea Montanari,et al.  Deep learning: a statistical viewpoint , 2021, Acta Numerica.

[22]  Michael Unser,et al.  Learning Activation Functions in Deep (Spline) Neural Networks , 2020, IEEE Open Journal of Signal Processing.

[23]  Trevor Hastie,et al.  Overview of Supervised Learning , 2001 .

[24]  Zhi-Hua Zhou,et al.  Towards an Understanding of Benign Overfitting in Neural Networks , 2021, ArXiv.

[25]  Michael Unser,et al.  A representer theorem for deep neural networks , 2018, J. Mach. Learn. Res..

[26]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[27]  M. Unser,et al.  Native Banach spaces for splines and variational inverse problems , 2019, 1904.10818.

[28]  Michael Unser,et al.  Sparsest Continuous Piecewise-Linear Representation of Data , 2020 .

[29]  Michael Unser,et al.  Deep Convolutional Neural Network for Inverse Problems in Imaging , 2016, IEEE Transactions on Image Processing.

[30]  David Rolnick,et al.  Complexity of Linear Regions in Deep Networks , 2019, ICML.

[31]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[32]  Alexander Rakhlin,et al.  Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon , 2018, COLT.

[33]  Michael Unser,et al.  An Introduction to Sparse Stochastic Processes , 2014 .

[34]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[35]  Raman Arora,et al.  Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[36]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[37]  M. Bergounioux,et al.  A Second-Order Model for Image Denoising , 2010 .

[38]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[39]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[40]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[41]  Michael Unser,et al.  Hessian Schatten-Norm Regularization for Linear Inverse Problems , 2012, IEEE Transactions on Image Processing.

[42]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[43]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[44]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[45]  Razvan Pascanu,et al.  On the number of response regions of deep feed forward networks with piece-wise linear activations , 2013, 1312.6098.

[46]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[47]  S. Mendelson,et al.  Regularization in kernel learning , 2010, 1001.2094.

[48]  Levent Onural,et al.  Impulse functions over curves and surfaces and their applications to diffraction , 2006 .