Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?

Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of \textquotedblleft better inductive bias.\textquotedblright\ However, this has not been made mathematically rigorous, and the hurdle is that the sufficiently wide fully-connected net can always simulate the convolutional net. Thus the training algorithm plays a role. The current work describes a natural task on which a provable sample complexity gap can be shown, for standard training algorithms. We construct a single natural distribution on $\mathbb{R}^d\times\{\pm 1\}$ on which any orthogonal-invariant algorithm (i.e. fully-connected networks trained with most gradient-based methods from gaussian initialization) requires $\Omega(d^2)$ samples to generalize while $O(1)$ samples suffice for convolutional architectures. Furthermore, we demonstrate a single target function, learning which on all possible distributions leads to an $O(1)$ vs $\Omega(d^2/\varepsilon)$ gap. The proof relies on the fact that SGD on fully-connected network is orthogonal equivariant. Similar results are achieved for $\ell_2$ regression and adaptive training algorithms, e.g. Adam and AdaGrad, which are only permutation equivariant.

[1]  Yuanzhi Li,et al.  What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.

[2]  Ohad Shamir,et al.  On the Iteration Complexity of Oblivious First-Order Optimization Algorithms , 2016, ICML.

[3]  Alon Itai,et al.  Learnability with Respect to Fixed Distributions , 1991, Theor. Comput. Sci..

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[6]  Yihong Wu,et al.  Volume ratio, sparsity, and minimaxity under unitarily invariant norms , 2013, 2013 IEEE International Symposium on Information Theory.

[7]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Philip M. Long On the sample complexity of PAC learning half-spaces against the uniform distribution , 1995, IEEE Trans. Neural Networks.

[9]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[10]  Sivaraman Balakrishnan,et al.  How Many Samples are Needed to Estimate a Convolutional Neural Network? , 2018, NeurIPS.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  E HintonGeoffrey,et al.  ImageNet classification with deep convolutional neural networks , 2017 .