On the Computational Efficiency of Training Neural Networks

It is well-known that neural networks are computationally hard to train. On the other hand, in practice, modern day neural networks are trained efficiently using SGD and a variety of tricks that include different activation functions (e.g. ReLU), over-specification (i.e., train networks which are larger than needed), and regularization. In this paper we revisit the computational complexity of training neural networks from a modern perspective. We provide both positive and negative results, some of them yield new provably efficient and practical algorithms for training certain types of neural networks.

[1]  Michael J. Fischer,et al.  Relations Among Complexity Measures , 1979, JACM.

[2]  Eitan M. Gurari,et al.  Introduction to the theory of computation , 1989 .

[3]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[4]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[5]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[6]  P. Bartlett,et al.  Hardness results for neural network approximation problems , 1999, Theor. Comput. Sci..

[7]  Rocco A. Servedio,et al.  Agnostically learning halfspaces , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[8]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[10]  Adam Tauman Kalai,et al.  Learning and Smoothed Analysis , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[11]  Ryan O'Donnell,et al.  Polynomial regression under arbitrary product distributions , 2010, Machine Learning.

[12]  Ohad Shamir,et al.  Learning Kernel-Based Halfspaces with the 0-1 Loss , 2011, SIAM J. Comput..

[13]  Ohad Shamir,et al.  Large-Scale Convex Minimization with a Low-Rank Constraint , 2011, ICML.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[19]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Neural Networks , 2013 .

[20]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[21]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[22]  Nathan Linial,et al.  From average case complexity to improper learning complexity , 2013, STOC.

[23]  Alexandr Andoni,et al.  Learning Sparse Polynomial Functions , 2014, SODA.

[24]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[25]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .