论文信息 - Identification of Shallow Neural Networks by Fewest Samples - 字舞流文

Identification of Shallow Neural Networks by Fewest Samples

We address the uniform approximation of sums of ridge functions $\sum_{i=1}^m g_i(a_i\cdot x)$ on ${\mathbb R}^d$, representing the shallowest form of feed-forward neural network, from a small number of query samples, under mild smoothness assumptions on the functions $g_i$'s and near-orthogonality of the ridge directions $a_i$'s. The sample points are randomly generated and are universal, in the sense that the sampled queries on those points will allow the proposed recovery algorithms to perform a uniform approximation of any sum of ridge functions with high-probability. Our general approximation strategy is developed as a sequence of algorithms to perform individual sub-tasks. We first approximate the span of the ridge directions. Then we use a straightforward substitution, which reduces the dimensionality of the problem from $d$ to $m$. The core of the construction is then the approximation of ridge directions expressed in terms of rank-$1$ matrices $a_i \otimes a_i$, realized by formulating their individual identification as a suitable nonlinear program, maximizing the spectral norm of certain competitors constrained over the unit Frobenius sphere. The final step is then to approximate the functions $g_1,\dots,g_m$ by $\hat g_1,\dots,\hat g_m$. Higher order differentiation, as used in our construction, of sums of ridge functions or of their compositions, as in deeper neural network, yields a natural connection between neural network weight identification and tensor product decomposition identification. In the case of the shallowest feed-forward neural network, we show that second order differentiation and tensors of order two (i.e., matrices) suffice.

Jan Vybíral | Massimo Fornasier | Ingrid Daubechies | I. Daubechies | M. Fornasier | J. Vybíral

[1] C. Stein. Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[2] Vin de Silva,et al. Tensor rank and the ill-posedness of the best low-rank approximation problem , 2006, math/0607647.

[3] Joel A. Tropp,et al. User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[4] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[5] Johannes Stallkamp,et al. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition , 2012, Neural Networks.

[6] C. Fefferman. Reconstructing a neural net from its output , 1994 .

[7] Xin Li. Interpolation by ridge polynomials and its application in neural networks , 2002 .

[8] Paul G. Constantine,et al. Active Subspaces - Emerging Ideas for Dimension Reduction in Parameter Studies , 2015, SIAM spotlights.

[9] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] B. Logan,et al. Optimal reconstruction of a function from its projections , 1975 .

[11] Tamara G. Kolda,et al. Symmetric Orthogonal Tensor Decomposition is Trivial , 2015, ArXiv.

[12] Andrea Montanari,et al. On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition , 2018, AISTATS.

[13] P. Petrushev. Approximation by ridge functions and neural networks , 1999 .

[14] Jürgen Schmidhuber,et al. Multi-column deep neural network for traffic sign classification , 2012, Neural Networks.

[15] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16] Luc Devroye,et al. Nonparametric Density Estimation , 1985 .

[17] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[18] Faperj. Sums of random Hermitian matrices and an inequality by Rudelson , 2010 .

[19] Anima Anandkumar,et al. Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates , 2014, ArXiv.

[20] Mark Rudelson,et al. Sampling from large matrices: An approach through geometric functional analysis , 2005, JACM.

[21] I. Daubechies,et al. Capturing Ridge Functions in High Dimensions from Point Queries , 2012 .

[22] Ker-Chau Li,et al. On Principal Hessian Directions for Data Visualization and Dimension Reduction: Another Application of Stein's Lemma , 1992 .

[23] Nathan Halko,et al. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[24] I. Johnstone,et al. Projection-Based Approximation and a Duality with Kernel Methods , 1989 .

[25] Qiqi Wang,et al. Erratum: Active Subspace Methods in Theory and Practice: Applications to Kriging Surfaces , 2014, SIAM J. Sci. Comput..

[26] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.

[27] G. Stewart. Perturbation theory for the singular value decomposition , 1990 .

[28] E. Yaz. Linear Matrix Inequalities In System And Control Theory , 1998, Proceedings of the IEEE.

[29] Jan Vybíral,et al. Entropy and Sampling Numbers of Classes of Ridge Functions , 2013, 1311.2005.

[30] Anima Anandkumar,et al. Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[31] W. Marsden. I and J , 2012 .

[32] André Uschmajew,et al. Finding a low-rank basis in a matrix subspace , 2015, Math. Program..

[33] Grant M. Rotskoff,et al. Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[34] Jan Vybíral,et al. Learning Functions of Few Arbitrary Linear Parameters in High Dimensions , 2010, Found. Comput. Math..

[35] M. Mella. Singularities of linear systems and the Waring problem , 2004, math/0406288.

[36] A. Pinkus. Ridge Functions: Approximation Algorithms , 2015 .

[37] P. Erd6s. ON A CLASSICAL PROBLEM OF PROBABILITY THEORY b , 2001 .

[38] W. Hackbusch. Tensor Spaces and Numerical Tensor Calculus , 2012, Springer Series in Computational Mathematics.

[39] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[40] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[41] Thomas Hofmann,et al. Greedy Layer-Wise Training of Deep Networks , 2007 .

[42] P. Wedin. Perturbation bounds in connection with singular value decomposition , 1972 .

[43] Luke Oeding,et al. Eigenvectors of tensors and algorithms for Waring decomposition , 2011, J. Symb. Comput..

[44] Robin Sibson,et al. What is projection pursuit , 1987 .

[45] E. Candès. Ridgelets: estimating with ridge functions , 2003 .

[46] Daniel Soudry,et al. No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[47] Thomas M. Breuel,et al. High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[48] Elina Robeva,et al. Orthogonal Decomposition of Symmetric Tensors , 2014, SIAM J. Matrix Anal. Appl..

[49] Klaus-Robert Müller,et al. Interpretable deep neural networks for single-trial EEG classification , 2016, Journal of Neuroscience Methods.

[50] Francis R. Bach,et al. Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[51] T. Tao. Topics in Random Matrix Theory , 2012 .

[52] David A. Wagner,et al. Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[53] Allan Pinkus. Approximating by Ridge Functions , 1997 .

[54] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[55] Rudolf Ahlswede,et al. Strong converse for identification via quantum channels , 2000, IEEE Trans. Inf. Theory.

[56] L. Chiantini,et al. Weakly defective varieties , 2001 .

[57] Andrea Montanari,et al. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[58] J. Stephen Judd,et al. Neural network design and the complexity of learning , 1990, Neural network modeling and connectionism.

[59] A. Pinkus,et al. Identifying Linear Combinations of Ridge Functions , 1999 .

[60] Johan Håstad,et al. Tensor Rank is NP-Complete , 1989, ICALP.

[61] Stephen P. Boyd,et al. Linear Matrix Inequalities in Systems and Control Theory , 1994 .

[62] Christopher J. Hillar,et al. Most Tensor Problems Are NP-Hard , 2009, JACM.

[63] Ronald L. Rivest,et al. Training a 3-node neural network is NP-complete , 1988, COLT '88.