论文信息 - Depth Separations in Neural Networks: What is Actually Being Separated? - 字舞流文

Depth Separations in Neural Networks: What is Actually Being Separated?

Existing depth separation results for constant-depth networks essentially show that certain radial functions in $\mathbb{R}^d$, which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$. However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension $d$ (or equivalently, by scaling the function, the hardness result applies to $\mathcal{O}(1)$-Lipschitz functions only when the target accuracy $\epsilon$ is at most $\text{poly}(1/d)$). In this paper, we study whether such depth separations might still hold in the natural setting of $\mathcal{O}(1)$-Lipschitz radial functions, when $\epsilon$ does not scale with $d$. Perhaps surprisingly, we show that the answer is negative: In contrast to the intuition suggested by previous work, it \emph{is} possible to approximate $\mathcal{O}(1)$-Lipschitz radial functions with depth $2$, size $\text{poly}(d)$ networks, for every constant $\epsilon$. We complement it by showing that approximating such functions is also possible with depth $2$, size $\text{poly}(1/\epsilon)$ networks, for every constant $d$. Finally, we show that it is not possible to have polynomial dependence in both $d,1/\epsilon$ simultaneously. Overall, our results indicate that in order to show depth separations for expressing $\mathcal{O}(1)$-Lipschitz functions with constant accuracy -- if at all possible -- one would need fundamentally different techniques than existing ones in the literature.

Ohad Shamir | Ronen Eldan | Itay Safran | O. Shamir | Ronen Eldan | Itay Safran | Ohad Shamir

[1] Arjun K. Gupta,et al. Handbook of beta distribution and its applications , 2004 .

[2] Dmitry Yarotsky,et al. Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[3] Lorenzo Rosasco,et al. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[4] Ohad Shamir,et al. The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[5] Paul C. Leopardi. Distributing points on the sphere: partitions, separation, quadrature and energy , 2007 .

[6] Ohad Shamir,et al. Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[7] Razvan Pascanu,et al. On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[8] Ohad Shamir,et al. Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks , 2016, ICML.

[9] M. Irani. Vision Day Schedule Time Speaker and Collaborators Affiliation Title a General Preprocessing Method for Improved Performance of Epipolar Geometry Estimation Algorithms on the Expressive Power of Deep Learning: a Tensor Analysis , 2016 .

[10] Andrew R. Barron,et al. Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[11] Le Song,et al. On the Complexity of Learning Neural Networks , 2017, NIPS.

[12] Emmanuel Abbe,et al. Provable limitations of deep learning , 2018, ArXiv.

[13] Shai Ben-David,et al. Understanding Machine Learning: From Theory to Algorithms , 2014 .

[14] R. Srikant,et al. Why Deep Neural Networks? , 2016, ArXiv.

[15] Andrew R. Barron,et al. Approximation by Combinations of ReLU and Squared ReLU Ridge Functions With $\ell^1$ and $\ell^0$ Controls , 2016, IEEE Transactions on Information Theory.

[16] Yoshua Bengio,et al. Shallow vs. Deep Sum-Product Networks , 2011, NIPS.

[17] Matus Telgarsky,et al. Benefits of Depth in Neural Networks , 2016, COLT.

[18] James Martens,et al. On the Expressive Efficiency of Sum Product Networks , 2014, ArXiv.

[19] Ohad Shamir,et al. Failures of Gradient-Based Deep Learning , 2017, ICML.

[20] Y. Sinai,et al. Theory of probability and random processes , 2007 .

[21] Pravesh Kothari,et al. Embedding Hard Learning Problems into Gaussian Space , 2014, Electron. Colloquium Comput. Complex..

[22] A. Spitzbart,et al. Inverses of Vandermonde Matrices , 1958 .

[23] Andreas Winkelbauer,et al. Moments and Absolute Moments of the Normal Distribution , 2012, ArXiv.

[24] Rocco A. Servedio,et al. Agnostically learning halfspaces , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[25] Martin Kochol,et al. A note on approximation of a ball by polytopes , 2004, Discret. Optim..

[26] Surya Ganguli,et al. Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[27] Alexander Cloninger,et al. Provable approximation properties for deep neural networks , 2015, ArXiv.

[28] S. Boucheron,et al. Theory of classification : a survey of some recent advances , 2005 .

[29] Amit Daniely,et al. Depth Separation for Neural Networks , 2017, COLT.

[30] R. Zemel,et al. On the Representational Efficiency of Restricted Boltzmann Machines , 2013, NIPS 2013.

[31] R. Srikant,et al. Why Deep Neural Networks for Function Approximation? , 2016, ICLR.