Depth Separations in Neural Networks: What is Actually Being Separated?

Existing depth separation results for constant-depth networks essentially show that certain radial functions in $\mathbb{R}^d$, which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$. However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension $d$ (or equivalently, by scaling the function, the hardness result applies to $\mathcal{O}(1)$-Lipschitz functions only when the target accuracy $\epsilon$ is at most $\text{poly}(1/d)$). In this paper, we study whether such depth separations might still hold in the natural setting of $\mathcal{O}(1)$-Lipschitz radial functions, when $\epsilon$ does not scale with $d$. Perhaps surprisingly, we show that the answer is negative: In contrast to the intuition suggested by previous work, it \emph{is} possible to approximate $\mathcal{O}(1)$-Lipschitz radial functions with depth $2$, size $\text{poly}(d)$ networks, for every constant $\epsilon$. We complement it by showing that approximating such functions is also possible with depth $2$, size $\text{poly}(1/\epsilon)$ networks, for every constant $d$. Finally, we show that it is not possible to have polynomial dependence in both $d,1/\epsilon$ simultaneously. Overall, our results indicate that in order to show depth separations for expressing $\mathcal{O}(1)$-Lipschitz functions with constant accuracy -- if at all possible -- one would need fundamentally different techniques than existing ones in the literature.

[1]  Arjun K. Gupta,et al.  Handbook of beta distribution and its applications , 2004 .

[2]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[3]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[4]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[5]  Paul C. Leopardi Distributing points on the sphere: partitions, separation, quadrature and energy , 2007 .

[6]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[7]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[8]  Ohad Shamir,et al.  Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks , 2016, ICML.

[9]  M. Irani Vision Day Schedule Time Speaker and Collaborators Affiliation Title a General Preprocessing Method for Improved Performance of Epipolar Geometry Estimation Algorithms on the Expressive Power of Deep Learning: a Tensor Analysis , 2016 .

[10]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[11]  Le Song,et al.  On the Complexity of Learning Neural Networks , 2017, NIPS.

[12]  Emmanuel Abbe,et al.  Provable limitations of deep learning , 2018, ArXiv.

[13]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[14]  R. Srikant,et al.  Why Deep Neural Networks? , 2016, ArXiv.

[15]  Andrew R. Barron,et al.  Approximation by Combinations of ReLU and Squared ReLU Ridge Functions With $\ell^1$ and $\ell^0$ Controls , 2016, IEEE Transactions on Information Theory.

[16]  Yoshua Bengio,et al.  Shallow vs. Deep Sum-Product Networks , 2011, NIPS.

[17]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[18]  James Martens,et al.  On the Expressive Efficiency of Sum Product Networks , 2014, ArXiv.

[19]  Ohad Shamir,et al.  Failures of Gradient-Based Deep Learning , 2017, ICML.

[20]  Y. Sinai,et al.  Theory of probability and random processes , 2007 .

[21]  Pravesh Kothari,et al.  Embedding Hard Learning Problems into Gaussian Space , 2014, Electron. Colloquium Comput. Complex..

[22]  A. Spitzbart,et al.  Inverses of Vandermonde Matrices , 1958 .

[23]  Andreas Winkelbauer,et al.  Moments and Absolute Moments of the Normal Distribution , 2012, ArXiv.

[24]  Rocco A. Servedio,et al.  Agnostically learning halfspaces , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[25]  Martin Kochol,et al.  A note on approximation of a ball by polytopes , 2004, Discret. Optim..

[26]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[27]  Alexander Cloninger,et al.  Provable approximation properties for deep neural networks , 2015, ArXiv.

[28]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[29]  Amit Daniely,et al.  Depth Separation for Neural Networks , 2017, COLT.

[30]  R. Zemel,et al.  On the Representational Efficiency of Restricted Boltzmann Machines , 2013, NIPS 2013.

[31]  R. Srikant,et al.  Why Deep Neural Networks for Function Approximation? , 2016, ICLR.