Comparing the Parameter Complexity of Hypernetworks and the Embedding-Based Alternative

In the context of learning to map an input $I$ to a function $h_I:\mathcal{X}\to \mathbb{R}$, we compare two alternative methods: (i) an embedding-based method, which learns a fixed function in which $I$ is encoded as a conditioning signal $e(I)$ and the learned function takes the form $h_I(x) = q(x,e(I))$, and (ii) hypernetworks, in which the weights $\theta_I$ of the function $h_I(x) = g(x;\theta_I)$ are given by a hypernetwork $f$ as $\theta_I=f(I)$. We extend the theory of~\cite{devore} and provide a lower bound on the complexity of neural networks as function approximators, i.e., the number of trainable parameters. This extension, eliminates the requirements for the approximation method to be robust. Our results are then used to compare the complexities of $q$ and $g$, showing that under certain conditions and when letting the functions $e$ and $f$ be as large as we wish, $g$ can be smaller than $q$ by orders of magnitude. In addition, we show that for typical assumptions on the function to be approximated, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method.

[1]  G. Burton Sobolev Spaces , 2013 .

[2]  David Duvenaud,et al.  Stochastic Hyperparameter Optimization through Hypernetworks , 2018, ArXiv.

[3]  R. Meir,et al.  On the Approximation of Functional Classes Equipped with a Uniform Measure Using Ridge Functions , 1999 .

[4]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[5]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yee Whye Teh,et al.  Multiplicative Interactions and Where to Find Them , 2020, ICLR.

[7]  Raquel Urtasun,et al.  Graph HyperNetworks for Neural Architecture Search , 2018, ICLR.

[8]  Benjamin F. Grewe,et al.  Continual learning with hypernetworks , 2019, ICLR.

[9]  Lior Wolf,et al.  Molecule Property Prediction and Classification with Graph Hypernetworks , 2019, ArXiv.

[10]  Charles Fefferman,et al.  Recovering a Feed-Forward Net From Its Output , 1993, NIPS.

[11]  Lior Wolf,et al.  A Dynamic Convolutional Layer for short rangeweather prediction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Liwei Wang,et al.  The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[13]  Takashi Matsubara,et al.  Hypernetwork-based Implicit Posterior Estimation and Model Averaging of CNN , 2018, ACML.

[14]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[15]  Mark Sellke,et al.  Approximating Continuous Functions by ReLU Nets of Minimal Width , 2017, ArXiv.

[16]  Héctor J. Sussmann,et al.  Uniqueness of the weights for minimal feedforward nets with a given input-output map , 1992, Neural Networks.

[17]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  V. Maiorov On Best Approximation by Ridge Functions , 1999 .

[19]  Stefanie Jegelka,et al.  ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[20]  Yee Whye Teh,et al.  Conditional Neural Processes , 2018, ICML.

[21]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[22]  Hod Lipson,et al.  Principled Weight Initialization for Hypernetworks , 2020, ICLR.

[23]  A. Pinkus n-Widths in Approximation Theory , 1985 .

[24]  H. N. Mhaskar,et al.  Neural Networks for Optimal Approximation of Smooth and Analytic Functions , 1996, Neural Computation.

[25]  Bastian Goldlücke,et al.  Variational Analysis , 2014, Computer Vision, A Reference Guide.

[26]  R. DeVore,et al.  Optimal nonlinear approximation , 1989 .

[27]  Karol Borsuk Drei Sätze über die n-dimensionale euklidische Sphäre , 1933 .

[28]  Lior Wolf,et al.  Deep Meta Functionals for Shape Representation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Tomaso A. Poggio,et al.  When and Why Are Deep Networks Better Than Shallow Ones? , 2017, AAAI.

[30]  Luca Bertinetto,et al.  Learning feed-forward one-shot learners , 2016, NIPS.

[31]  Alexandre Lacoste,et al.  Bayesian Hypernetworks , 2017, ArXiv.

[32]  W. Groß Grundzüge der Mengenlehre , 1915 .

[33]  Verner Vlavci'c,et al.  Neural Network Identifiability for a Family of Sigmoidal Nonlinearities , 2019, Constructive Approximation.

[34]  Horst Bischof,et al.  Conditioned Regression Models for Non-blind Single Image Super-Resolution , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Theodore Lim,et al.  SMASH: One-Shot Model Architecture Search through HyperNetworks , 2017, ICLR.

[36]  Eduardo D. Sontag,et al.  UNIQUENESS OF WEIGHTS FOR NEURAL NETWORKS , 1993 .

[37]  Christopher T. J. Dodson,et al.  A User’s Guide to Algebraic Topology , 1996 .

[38]  Ohad Shamir,et al.  Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks , 2016, ICML.

[39]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[40]  Allan Pinkus,et al.  Lower bounds for approximation by MLP neural networks , 1999, Neurocomputing.