Learning Depth-Three Neural Networks in Polynomial Time

We give a polynomial-time algorithm for learning neural networks with one hidden layer of sigmoids feeding into any smooth, monotone activation function (e.g., sigmoid or ReLU). We make no assumptions on the structure of the network, and the algorithm succeeds with respect to {\em any} distribution on the unit ball in $n$ dimensions (hidden weight vectors also have unit norm). This is the first assumption-free, provably efficient algorithm for learning neural networks with more than one hidden layer. Our algorithm-- {\em Alphatron}-- is a simple, iterative update rule that combines isotonic regression with kernel methods. It outputs a hypothesis that yields efficient oracle access to interpretable features. It also suggests a new approach to Boolean function learning via smooth relaxations of hard thresholds, sidestepping traditional hardness results from computational learning theory. Along these lines, we give improved results for a number of longstanding problems related to Boolean concept learning, unifying a variety of different techniques. For example, we give the first polynomial-time algorithm for learning intersections of halfspaces with a margin (distribution-free) and the first generalization of DNF learning to the setting of probabilistic concepts (queries; uniform distribution). Finally, we give the first provably correct algorithms for common schemes in multiple-instance learning.

[1]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[2]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[3]  Varun Kanade,et al.  Reliably Learning the ReLU in Polynomial Time , 2016, COLT.

[4]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[5]  Vitaly Feldman Learning DNF Expressions from Fourier Spectrum , 2012, COLT.

[6]  Hendrik Blockeel,et al.  Multiple-Instance Learning , 2010, Encyclopedia of Machine Learning.

[7]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[8]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[9]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[10]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[11]  Amit Daniely,et al.  Complexity Theoretic Limitations on Learning DNF's , 2014, COLT.

[12]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[13]  Ohad Shamir,et al.  Depth Separation in ReLU Networks for Approximating Smooth Non-Linear Functions , 2016, ArXiv.

[14]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[15]  Amit Daniely,et al.  Complexity theoretic limitations on learning halfspaces , 2015, STOC.

[16]  Adam R. Klivans,et al.  Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks , 2017, NIPS.

[17]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[18]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[19]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[20]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[21]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[22]  Rocco A. Servedio,et al.  Learning intersections and thresholds of halfspaces , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[23]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[24]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[25]  Adam Tauman Kalai,et al.  Agnostically learning decision trees , 2008, STOC.

[26]  Jeffrey C. Jackson An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution , 1997, J. Comput. Syst. Sci..

[27]  Alexander A. Sherstov Making polynomials robust to noise , 2012, STOC '12.

[28]  Adam Tauman Kalai,et al.  A Note on Learning from Multiple-Instance Examples , 2004, Machine Learning.

[29]  Le Song,et al.  On the Complexity of Learning Neural Networks , 2017, NIPS.

[30]  Ohad Shamir,et al.  Learning Kernel-Based Halfspaces with the 0-1 Loss , 2011, SIAM J. Comput..

[31]  Amit Daniely A PTAS for Agnostically Learning Halfspaces , 2015, COLT.

[32]  Adam Tauman Kalai,et al.  The Isotron Algorithm: High-Dimensional Isotonic Regression , 2009, COLT.

[33]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[34]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[35]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[36]  Naftali Tishby,et al.  Multi-instance learning with any hypothesis class , 2011, J. Mach. Learn. Res..

[37]  Eyal Kushilevitz,et al.  Learning decision trees using the Fourier spectrum , 1991, STOC '91.

[38]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[39]  Adam Tauman Kalai,et al.  Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression , 2011, NIPS.

[40]  Rina Panigrahy,et al.  Electron-Proton Dynamics in Deep Learning , 2017, ArXiv.

[41]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[42]  Raghu Meka,et al.  Moment-Matching Polynomials , 2013, Electron. Colloquium Comput. Complex..

[43]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[44]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[45]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[46]  Rocco A. Servedio,et al.  Learning intersections of halfspaces with a margin , 2004, J. Comput. Syst. Sci..

[47]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).