Learning Halfspaces and Neural Networks with Random Initialization

We study non-convex empirical risk minimization for learning halfspaces and neural networks. For loss functions that are $L$-Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk $\epsilon>0$. The time complexity is polynomial in the input dimension $d$ and the sample size $n$, but exponential in the quantity $(L/\epsilon^2)\log(L/\epsilon)$. These algorithms run multiple rounds of random initialization followed by arbitrary optimization steps. We further show that if the data is separable by some neural network with constant margin $\gamma>0$, then there is a polynomial-time algorithm for learning a neural network that separates the training data with margin $\Omega(\gamma)$. As a consequence, the algorithm achieves arbitrary generalization error $\epsilon>0$ with ${\rm poly}(d,1/\epsilon)$ sample and time complexity. We establish the same learnability result when the labels are randomly flipped with probability $\eta<1/2$.

[1]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[2]  Pravesh Kothari,et al.  Embedding Hard Learning Problems into Gaussian Space , 2014, Electron. Colloquium Comput. Complex..

[3]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[4]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[5]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[6]  Ohad Shamir,et al.  Learning Kernel-Based Halfspaces with the 0-1 Loss , 2011, SIAM J. Comput..

[7]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[8]  G. Pisier Remarques sur un résultat non publié de B. Maurey , 1981 .

[9]  Rocco A. Servedio,et al.  Agnostically learning halfspaces , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[10]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[11]  Bernard Widrow,et al.  Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[12]  Shai Shalev-Shwartz,et al.  Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs , 2012, NIPS.

[13]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[14]  Yoram Singer,et al.  On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[15]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[16]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[17]  Yuchen Zhang,et al.  L1-regularized Neural Networks are Improperly Learnable in Polynomial Time , 2015, ICML.

[18]  Nathan Linial,et al.  From average case complexity to improper learning complexity , 2013, STOC.

[19]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[20]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[21]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[22]  Rocco A. Servedio,et al.  Efficient algorithms in computational learning theory , 2001 .

[23]  Prasad Raghavendra,et al.  Hardness of Learning Halfspaces with Noise , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[24]  Hans Ulrich Simon,et al.  Efficient Learning of Linear Perceptrons , 2000, NIPS.

[25]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[26]  Anima Anandkumar,et al.  Generalization Bounds for Neural Networks through Tensor Factorization , 2015, ArXiv.

[27]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[28]  Rocco A. Servedio,et al.  Learning Halfspaces with Malicious Noise , 2009, ICALP.

[29]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[30]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[31]  Don R. Hush,et al.  Training a Sigmoidal Node Is Hard , 1999, Neural Computation.

[32]  Günther Palm,et al.  Sparse activity and sparse connectivity in supervised learning , 2016, J. Mach. Learn. Res..

[33]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[34]  Maria-Florina Balcan,et al.  The Power of Localization for Efficiently Learning Linear Separators with Noise , 2013, J. ACM.

[35]  Alan M. Frieze,et al.  A Polynomial-Time Algorithm for Learning Noisy Linear Threshold Functions , 1996, Algorithmica.

[36]  Maria-Florina Balcan,et al.  Efficient Learning of Linear Separators under Bounded Noise , 2015, COLT.

[37]  Anupam Gupta,et al.  An elementary proof of the Johnson-Lindenstrauss Lemma , 1999 .