On the Learnability of Fully-Connected Neural Networks

Despite the empirical success of deep neural networks, there is limited theoretical understanding of the learnability of these models with respect to polynomial-time algorithms. In this paper, we characterize the learnability of fullyconnected neural networks via both positive and negative results. We focus on `1-regularized networks, where the `1-norm of the incoming weights of every neuron is assumed to be bounded by a constant B > 0. Our first result shows that such networks are properly learnable in poly(n, d, exp(1/ )) time, where n and d are the sample size and the input dimension, and > 0 is the gap to optimality. The bound is achieved by repeatedly sampling over a low-dimensional manifold so as to ensure approximate optimality, but avoids the exp(d) cost of exhaustively searching over the parameter space. We also establish a hardness result showing that the exponential dependence on 1/ is unavoidable unless RP = NP. Our second result shows that the exponential dependence on 1/ can be avoided by exploiting the underlying structure of the data distribution. In particular, if the positive and negative examples can be separated with margin γ > 0 by an unknown neural network, then the network can be learned in poly(n, d, 1/ ) time. The bound is achieved by an ensemble method which uses the first algorithm as a weak learner. We further show that the separability assumption can be weakened to tolerate noisy labels. Finally, we show that the exponential dependence on 1/γ is unimprovable under a certain cryptographic assumption. Proceedings of the 20 International Conference on Artificial Intelligence and Statistics (AISTATS) 2017, Fort Lauderdale, Florida, USA. JMLR: W&CP volume 54. Copyright 2017 by the author(s).

[1]  G. Pisier Remarques sur un résultat non publié de B. Maurey , 1981 .

[2]  Bernard Widrow,et al.  Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[3]  Mihalis Yannakakis,et al.  Optimization, approximation, and complexity classes , 1991, STOC '88.

[4]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[5]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[6]  Wolfgang Maass,et al.  Agnostic PAC Learning of Functions on Analog Neural Nets , 1993, Neural Computation.

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[9]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[10]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[11]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[12]  Naftali Tishby,et al.  NOISE TOLERANT LEARNING USING EARLY PREDICTORS , 1999 .

[13]  Yoshua Bengio,et al.  Boosting Neural Networks , 2000, Neural Computation.

[14]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[15]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[16]  Adam Tauman Kalai,et al.  Noise-tolerant learning, the parity problem, and the statistical query model , 2000, STOC '00.

[17]  R. Steele,et al.  Optimization , 2005, Encyclopedia of Biometrics.

[18]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, FOCS.

[19]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[20]  Yoram Singer,et al.  On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[21]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[22]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Günther Palm,et al.  Sparse activity and sparse connectivity in supervised learning , 2016, J. Mach. Learn. Res..

[25]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[26]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[27]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[28]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[29]  Anima Anandkumar,et al.  Generalization Bounds for Neural Networks through Tensor Factorization , 2015, ArXiv.

[30]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[31]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[32]  Yuchen Zhang,et al.  L1-regularized Neural Networks are Improperly Learnable in Polynomial Time , 2015, ICML.