Breaking the Curse of Dimensionality with Convex Neural Networks

We consider neural networks with a single hidden layer and non-decreasing homogeneous activa-tion functions like the rectified linear units. By letting the number of hidden units grow unbounded and using classical non-Euclidean regularization tools on the output weights, we provide a detailed theoretical analysis of their generalization performance, with a study of both the approximation and the estimation errors. We show in particular that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace. Moreover, when using sparsity-inducing norms on the input weights, we show that high-dimensional non-linear variable selection may be achieved, without any strong assumption regarding the data and with a total number of variables potentially exponential in the number of ob-servations. In addition, we provide a simple geometric interpretation to the non-convex problem of addition of a new unit, which is the core potentially hard computational element in the framework of learning from continuously many basis functions. We provide simple conditions for convex relaxations to achieve the same generalization error bounds, even when constant-factor approxi-mations cannot be found (e.g., because it is NP-hard such as for the zero-homogeneous activation function). We were not able to find strong enough convex relaxations and leave open the existence or non-existence of polynomial-time algorithms.

[1]  H. Whitney Analytic Extensions of Differentiable Functions Defined in Closed Sets , 1934 .

[2]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[3]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[4]  G. Forsythe,et al.  On the Stationary Values of a Second-Degree Polynomial on the Unit Sphere , 1965 .

[5]  R. Schneider Zu einem Problem von Shephard über die Projektionen konvexer Körper , 1967 .

[6]  V. F. Dem'yanov,et al.  The Minimization of a Smooth Convex Functional on a Convex Set , 1967 .

[7]  E. Bolker A class of convex bodies , 1969 .

[8]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[9]  J. Dunn,et al.  Conditional gradient algorithms with open loop step size rules , 1978 .

[10]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[11]  M. Savard Bach , 1985 .

[12]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[13]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[14]  W. Rudin Real and complex analysis, 3rd ed. , 1987 .

[15]  Herbert Edelsbrunner,et al.  Algorithms in Combinatorial Geometry , 1987, EATCS Monographs in Theoretical Computer Science.

[16]  J. Lindenstrauss,et al.  Approximation of zonoids by zonotopes , 1989 .

[17]  R. DeVore,et al.  Optimal nonlinear approximation , 1989 .

[18]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[19]  L. Evans Measure theory and fine properties of functions , 1992 .

[20]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[21]  Leo Breiman,et al.  Hinging hyperplanes for regression, classification, and function approximation , 1993, IEEE Trans. Inf. Theory.

[22]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[23]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[24]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[25]  J. Matousek,et al.  Improved upper bounds for approximation by zonotopes , 1996 .

[26]  Geoffrey E. Hinton,et al.  Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[27]  Y. Nesterov Semidefinite relaxation and nonconvex quadratic optimization , 1998 .

[28]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[29]  Y. Makovoz Uniform Approximation by Neural Networks , 1998 .

[30]  P. Petrushev Approximation by ridge functions and neural networks , 1999 .

[31]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[32]  Alexander J. Smola,et al.  Regularization with Dot-Product Kernels , 2000, NIPS.

[33]  Ron Meir,et al.  On the near optimality of the stochastic approximation of smooth functions by neural networks , 2000, Adv. Comput. Math..

[34]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[35]  Marcello Sanguineti,et al.  Bounds on rates of variable-basis and neural-network approximation , 2001, IEEE Trans. Inf. Theory.

[36]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[37]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[38]  Martin Burger,et al.  Error Bounds for Approximation with Neural Networks , 2001, J. Approx. Theory.

[39]  Alexander Barvinok,et al.  A course in convexity , 2002, Graduate studies in mathematics.

[40]  Chong Gu Smoothing Spline Anova Models , 2002 .

[41]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[42]  Leonidas J. Guibas,et al.  Zonotopes as bounding volumes , 2003, SODA '03.

[43]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[44]  Ulrike von Luxburg,et al.  Distance-Based Classification with Lipschitz Functions , 2004, J. Mach. Learn. Res..

[45]  Hrushikesh Narhar Mhaskar,et al.  On the tractability of multivariate integration and approximation by neural networks , 2004, J. Complex..

[46]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[47]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004 .

[48]  Ronald,et al.  Learning representations by backpropagating errors , 2004 .

[49]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[50]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[51]  Baver Okutmustur Reproducing kernel Hilbert spaces , 2005 .

[52]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[53]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[54]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[55]  Prasad Raghavendra,et al.  Hardness of Learning Halfspaces with Noise , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[56]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[57]  Hrushikesh Narhar Mhaskar,et al.  Weighted quadrature formulas and approximation by zonal function networks on the sphere , 2006, J. Complex..

[58]  Vitaly Maiorov,et al.  Approximation by neural networks and learning theory , 2006, J. Complex..

[59]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, FOCS.

[60]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[61]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[62]  Ji Zhu,et al.  l1 Regularization in Infinite Dimensional Feature Spaces , 2007, COLT.

[63]  Robert E. Mahony,et al.  Optimization Algorithms on Matrix Manifolds , 2007 .

[64]  Nathan Srebro,et al.  ` 1 Regularization in Infinite Dimensional Feature Spaces , 2007 .

[65]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[66]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[67]  Nicolas Le Roux,et al.  Continuous Neural Networks , 2007, AISTATS.

[68]  Larry A. Wasserman,et al.  SpAM: Sparse Additive Models , 2007, NIPS.

[69]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[70]  Arnak S. Dalalyan,et al.  A New Algorithm for Estimating the Effective Dimension-Reduction Subspace , 2008, J. Mach. Learn. Res..

[71]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[72]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[73]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[74]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[75]  K. Böröczky About projection bodies , 2011 .

[76]  R. Cooke Real and Complex Analysis , 2011 .

[77]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[78]  Yaoliang Yu,et al.  Accelerated Training for Matrix-norm Regularization: A Boosting Approach , 2012, NIPS.

[79]  K. Atkinson,et al.  Spherical Harmonics and Approximations on the Unit Sphere: An Introduction , 2012 .

[80]  Zaïd Harchaoui,et al.  Lifted coordinate descent for learning with trace-norm regularization , 2012, AISTATS.

[81]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[82]  Karthik Sridharan,et al.  Learning From An Optimization Viewpoint , 2012, ArXiv.

[83]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[84]  Guanghui Lan The Complexity of Large-scale Convex Programming under a Linear Optimization Oracle , 2013, 1309.5550.

[85]  Francis R. Bach,et al.  Convex relaxations of structured matrix factorizations , 2013, ArXiv.

[86]  C. Frye,et al.  Spherical Harmonics in p Dimensions , 2012, 1205.3548.

[87]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[88]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[89]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[90]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[91]  Stefan König,et al.  Computational aspects of the Hausdorff distance in unbounded dimension , 2014, J. Comput. Geom..

[92]  Francis R. Bach,et al.  On the Equivalence between Quadrature Rules and Random Features , 2015, ArXiv.

[93]  Francis R. Bach,et al.  Duality Between Subgradient and Conditional Gradient Methods , 2012, SIAM J. Optim..

[94]  Zaïd Harchaoui,et al.  Conditional gradient algorithms for norm-regularized smooth convex optimization , 2013, Math. Program..

[95]  Francis R. Bach,et al.  On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , 2015, J. Mach. Learn. Res..

[96]  L. Rosasco,et al.  Reproducing kernel Hilbert spaces , 2019, High-Dimensional Statistics.