论文信息 - On the Learnability of Fully-Connected Neural Networks - 字舞流文

On the Learnability of Fully-Connected Neural Networks

Despite the empirical success of deep neural networks, there is limited theoretical understanding of the learnability of these models with respect to polynomial-time algorithms. In this paper, we characterize the learnability of fullyconnected neural networks via both positive and negative results. We focus on `1-regularized networks, where the `1-norm of the incoming weights of every neuron is assumed to be bounded by a constant B > 0. Our first result shows that such networks are properly learnable in poly(n, d, exp(1/ )) time, where n and d are the sample size and the input dimension, and > 0 is the gap to optimality. The bound is achieved by repeatedly sampling over a low-dimensional manifold so as to ensure approximate optimality, but avoids the exp(d) cost of exhaustively searching over the parameter space. We also establish a hardness result showing that the exponential dependence on 1/ is unavoidable unless RP = NP. Our second result shows that the exponential dependence on 1/ can be avoided by exploiting the underlying structure of the data distribution. In particular, if the positive and negative examples can be separated with margin γ > 0 by an unknown neural network, then the network can be learned in poly(n, d, 1/ ) time. The bound is achieved by an ensemble method which uses the first algorithm as a weak learner. We further show that the separability assumption can be weakened to tolerate noisy labels. Finally, we show that the exponential dependence on 1/γ is unimprovable under a certain cryptographic assumption. Proceedings of the 20 International Conference on Artificial Intelligence and Statistics (AISTATS) 2017, Fort Lauderdale, Florida, USA. JMLR: W&CP volume 54. Copyright 2017 by the author(s).

Martin J. Wainwright | Yuchen Zhang | Michael I. Jordan | Jason D. Lee | M. Wainwright | Yuchen Zhang | J. Lee

[1] G. Pisier. Remarques sur un résultat non publié de B. Maurey , 1981 .

[2] Bernard Widrow,et al. Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[3] Mihalis Yannakakis,et al. Optimization, approximation, and complexity classes , 1991, STOC '88.

[4] M. Talagrand,et al. Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[5] Ronald L. Rivest,et al. Training a 3-node neural network is NP-complete , 1988, COLT '88.

[6] Wolfgang Maass,et al. Agnostic PAC Learning of Functions on Analog Neural Nets , 1993, Neural Computation.

[7] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8] Peter L. Bartlett,et al. Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[9] Peter L. Bartlett,et al. The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[10] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[11] Yoram Singer,et al. Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[12] Naftali Tishby,et al. NOISE TOLERANT LEARNING USING EARLY PREDICTORS , 1999 .

[13] Yoshua Bengio,et al. Boosting Neural Networks , 2000, Neural Computation.

[14] Peter L. Bartlett,et al. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[15] V. Koltchinskii,et al. Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[16] Adam Tauman Kalai,et al. Noise-tolerant learning, the parity problem, and the statistical query model , 2000, STOC '00.

[17] R. Steele,et al. Optimization , 2005, Encyclopedia of Biometrics.

[18] Alexander A. Sherstov,et al. Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, FOCS.

[19] Ambuj Tewari,et al. On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[20] Yoram Singer,et al. On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[21] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[22] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[23] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24] Günther Palm,et al. Sparse activity and sparse connectivity in supervised learning , 2016, J. Mach. Learn. Res..

[25] Aditya Bhaskara,et al. Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[26] Roi Livni,et al. On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[27] Ryota Tomioka,et al. Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[28] Anima Anandkumar,et al. Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[29] Anima Anandkumar,et al. Generalization Bounds for Neural Networks through Tensor Factorization , 2015, ArXiv.

[30] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[31] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[32] Yuchen Zhang,et al. L1-regularized Neural Networks are Improperly Learnable in Polynomial Time , 2015, ICML.