Efficient Learning of CNNs using Patch Based Features

Recent work has demonstrated the effectiveness of using patch based representations when learning from image data. Here we provide theoretical support for this observation, by showing that a simple semi-supervised algorithm that uses patch statistics can efficiently learn labels produced by a one-hidden-layer Convolutional Neural Network (CNN). Since CNNs are known to be computationally hard to learn in the worst case, our analysis holds under some distributional assumptions. We show that these assumptions are necessary and sufficient for our results to hold. We verify that the distributional assumptions hold on real-world data by experimenting on the CIFAR-10 dataset, and find that the analyzed algorithm outperforms a vanilla one-hidden-layer CNN. Finally, we demonstrate that by running the algorithm in a layer-by-layer fashion we can build a deep model which gives further improvements, hinting that this method provides insights about the behavior of deep CNNs.

[1]  J. Zico Kolter,et al.  Patches Are All You Need? , 2022, Trans. Mach. Learn. Res..

[2]  Dar Gilboa,et al.  Deep Networks Provably Classify Data on Curves , 2021, NeurIPS.

[3]  Aravindan Vijayaraghavan,et al.  Efficient Algorithms for Learning Depth-2 Neural Networks with General ReLU Activations , 2021, Neural Information Processing Systems.

[4]  Joan Bruna,et al.  On the Cryptographic Hardness of Learning Single Periodic Neurons , 2021, NeurIPS.

[5]  Jeff Z. HaoChen,et al.  Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss , 2021, NeurIPS.

[6]  Alexander Cloninger,et al.  A deep network construction that adapts to intrinsic dimensionality beyond the domain , 2021, Neural Networks.

[7]  A. Dosovitskiy,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[8]  Eran Malach,et al.  Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels , 2021, ICML.

[9]  Amit Daniely,et al.  From Local Pseudorandom Generators to Hardness of Learning , 2021, COLT.

[10]  Edouard Oyallon,et al.  The Unreasonable Effectiveness of Patches in Deep Convolutional Kernels Methods , 2021, ICLR.

[11]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[12]  John Wright,et al.  Deep Networks and the Multiple Manifold Problem , 2020, ICLR.

[13]  Andrea Montanari,et al.  When do neural networks outperform kernel methods? , 2020, NeurIPS.

[14]  Amit Daniely,et al.  Hardness of Learning Neural Networks with Natural Weights , 2020, NeurIPS.

[15]  Daniel M. Kane,et al.  Algorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer ReLU Networks , 2020, COLT.

[16]  Ingo Steinwart,et al.  Adaptive learning rates for support vector machines working on data with low intrinsic dimension , 2020, The Annals of Statistics.

[17]  Nathan Srebro,et al.  Approximate is Good Enough: Probabilistic Variants of Dimensional and Margin Complexity , 2020, COLT 2020.

[18]  Amit Daniely,et al.  Learning Parities with Neural Networks , 2020, NeurIPS.

[19]  Matus Telgarsky,et al.  Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2019, ICLR.

[20]  F. Krzakala,et al.  Modeling the Influence of Data Structure on Learning in Neural Networks: The Hidden Manifold Model , 2019, Physical Review X.

[21]  T. Zhao,et al.  Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks , 2019, 1908.01842.

[22]  Johannes Schmidt-Hieber,et al.  Deep ReLU network approximation of functions on a manifold , 2019, ArXiv.

[23]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[24]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[25]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[26]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[27]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[28]  Zhize Li,et al.  Learning Two-layer Neural Networks with Symmetric Inputs , 2018, ICLR.

[29]  Arthur Jacot,et al.  Neural Tangent Kernel: Convergence and Generalization in Neural Networks , 2018, NeurIPS.

[30]  Simon S. Du,et al.  Improved Learning of One-hidden-layer Convolutional Neural Networks with Overlaps , 2018, ArXiv.

[31]  Samet Oymak,et al.  End-to-end Learning of a Convolutional Neural Network via Deep Tensor Decomposition , 2018, ArXiv.

[32]  Shai Shalev-Shwartz,et al.  A Provably Correct Algorithm for Deep Learning that Actually Works , 2018, ArXiv.

[33]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[34]  Yuandong Tian,et al.  Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[35]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[36]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[37]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[38]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[39]  Ronen Basri,et al.  Efficient Representation of Low-Dimensional Manifolds using Deep Networks , 2016, ICLR.

[40]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[41]  Shai Shalev-Shwartz,et al.  K-means recovers ICA filters when independent components are sparse , 2014, ICML.

[42]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[43]  Amit Daniely,et al.  Complexity Theoretic Limitations on Learning DNF's , 2014, COLT.

[44]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[45]  Razvan Pascanu,et al.  On the number of response regions of deep feed forward networks with piece-wise linear activations , 2013, 1312.6098.

[46]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[47]  David P. Williamson,et al.  The Design of Approximation Algorithms , 2011 .

[48]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[49]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[50]  Ding-Xuan Zhou,et al.  The covering number in learning theory , 2002, J. Complex..

[51]  Sanjeev R. Kulkarni,et al.  Covering numbers for real-valued function classes , 1997, IEEE Trans. Inf. Theory.

[52]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[53]  Kenneth Falconer,et al.  Fractal Geometry: Mathematical Foundations and Applications , 1990 .

[54]  Alon Brutzkus,et al.  An optimization and generalization analysis for max-pooling networks , 2021, UAI.

[55]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[56]  Alexander A. Sherstov,et al.  Cryptographic Hardness Results for Learning Intersections of Halfspaces , 2006, Electron. Colloquium Comput. Complex..

[57]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[58]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..