Identification of Shallow Neural Networks by Fewest Samples

We address the uniform approximation of sums of ridge functions $\sum_{i=1}^m g_i(a_i\cdot x)$ on ${\mathbb R}^d$, representing the shallowest form of feed-forward neural network, from a small number of query samples, under mild smoothness assumptions on the functions $g_i$'s and near-orthogonality of the ridge directions $a_i$'s. The sample points are randomly generated and are universal, in the sense that the sampled queries on those points will allow the proposed recovery algorithms to perform a uniform approximation of any sum of ridge functions with high-probability. Our general approximation strategy is developed as a sequence of algorithms to perform individual sub-tasks. We first approximate the span of the ridge directions. Then we use a straightforward substitution, which reduces the dimensionality of the problem from $d$ to $m$. The core of the construction is then the approximation of ridge directions expressed in terms of rank-$1$ matrices $a_i \otimes a_i$, realized by formulating their individual identification as a suitable nonlinear program, maximizing the spectral norm of certain competitors constrained over the unit Frobenius sphere. The final step is then to approximate the functions $g_1,\dots,g_m$ by $\hat g_1,\dots,\hat g_m$. Higher order differentiation, as used in our construction, of sums of ridge functions or of their compositions, as in deeper neural network, yields a natural connection between neural network weight identification and tensor product decomposition identification. In the case of the shallowest feed-forward neural network, we show that second order differentiation and tensors of order two (i.e., matrices) suffice.

[1]  C. Stein Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[2]  Vin de Silva,et al.  Tensor rank and the ill-posedness of the best low-rank approximation problem , 2006, math/0607647.

[3]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[4]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[5]  Johannes Stallkamp,et al.  Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition , 2012, Neural Networks.

[6]  C. Fefferman Reconstructing a neural net from its output , 1994 .

[7]  Xin Li Interpolation by ridge polynomials and its application in neural networks , 2002 .

[8]  Paul G. Constantine,et al.  Active Subspaces - Emerging Ideas for Dimension Reduction in Parameter Studies , 2015, SIAM spotlights.

[9]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  B. Logan,et al.  Optimal reconstruction of a function from its projections , 1975 .

[11]  Tamara G. Kolda,et al.  Symmetric Orthogonal Tensor Decomposition is Trivial , 2015, ArXiv.

[12]  Andrea Montanari,et al.  On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition , 2018, AISTATS.

[13]  P. Petrushev Approximation by ridge functions and neural networks , 1999 .

[14]  Jürgen Schmidhuber,et al.  Multi-column deep neural network for traffic sign classification , 2012, Neural Networks.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Luc Devroye,et al.  Nonparametric Density Estimation , 1985 .

[17]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[18]  Faperj Sums of random Hermitian matrices and an inequality by Rudelson , 2010 .

[19]  Anima Anandkumar,et al.  Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates , 2014, ArXiv.

[20]  Mark Rudelson,et al.  Sampling from large matrices: An approach through geometric functional analysis , 2005, JACM.

[21]  I. Daubechies,et al.  Capturing Ridge Functions in High Dimensions from Point Queries , 2012 .

[22]  Ker-Chau Li,et al.  On Principal Hessian Directions for Data Visualization and Dimension Reduction: Another Application of Stein's Lemma , 1992 .

[23]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[24]  I. Johnstone,et al.  Projection-Based Approximation and a Duality with Kernel Methods , 1989 .

[25]  Qiqi Wang,et al.  Erratum: Active Subspace Methods in Theory and Practice: Applications to Kriging Surfaces , 2014, SIAM J. Sci. Comput..

[26]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[27]  G. Stewart Perturbation theory for the singular value decomposition , 1990 .

[28]  E. Yaz Linear Matrix Inequalities In System And Control Theory , 1998, Proceedings of the IEEE.

[29]  Jan Vybíral,et al.  Entropy and Sampling Numbers of Classes of Ridge Functions , 2013, 1311.2005.

[30]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[31]  W. Marsden I and J , 2012 .

[32]  André Uschmajew,et al.  Finding a low-rank basis in a matrix subspace , 2015, Math. Program..

[33]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[34]  Jan Vybíral,et al.  Learning Functions of Few Arbitrary Linear Parameters in High Dimensions , 2010, Found. Comput. Math..

[35]  M. Mella Singularities of linear systems and the Waring problem , 2004, math/0406288.

[36]  A. Pinkus Ridge Functions: Approximation Algorithms , 2015 .

[37]  P. Erd6s ON A CLASSICAL PROBLEM OF PROBABILITY THEORY b , 2001 .

[38]  W. Hackbusch Tensor Spaces and Numerical Tensor Calculus , 2012, Springer Series in Computational Mathematics.

[39]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[40]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[41]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[42]  P. Wedin Perturbation bounds in connection with singular value decomposition , 1972 .

[43]  Luke Oeding,et al.  Eigenvectors of tensors and algorithms for Waring decomposition , 2011, J. Symb. Comput..

[44]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[45]  E. Candès Ridgelets: estimating with ridge functions , 2003 .

[46]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[47]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[48]  Elina Robeva,et al.  Orthogonal Decomposition of Symmetric Tensors , 2014, SIAM J. Matrix Anal. Appl..

[49]  Klaus-Robert Müller,et al.  Interpretable deep neural networks for single-trial EEG classification , 2016, Journal of Neuroscience Methods.

[50]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[51]  T. Tao Topics in Random Matrix Theory , 2012 .

[52]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[53]  Allan Pinkus Approximating by Ridge Functions , 1997 .

[54]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[55]  Rudolf Ahlswede,et al.  Strong converse for identification via quantum channels , 2000, IEEE Trans. Inf. Theory.

[56]  L. Chiantini,et al.  Weakly defective varieties , 2001 .

[57]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[58]  J. Stephen Judd,et al.  Neural network design and the complexity of learning , 1990, Neural network modeling and connectionism.

[59]  A. Pinkus,et al.  Identifying Linear Combinations of Ridge Functions , 1999 .

[60]  Johan Håstad,et al.  Tensor Rank is NP-Complete , 1989, ICALP.

[61]  Stephen P. Boyd,et al.  Linear Matrix Inequalities in Systems and Control Theory , 1994 .

[62]  Christopher J. Hillar,et al.  Most Tensor Problems Are NP-Hard , 2009, JACM.

[63]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.