Robust and Resource-Efficient Identification of Two Hidden Layer Neural Networks

We address the structure identification and the uniform approximation of two fully nonlinear layer neural networks of the type $f(x)=1^T h(B^T g(A^T x))$ on $\mathbb R^d$ from a small number of query samples. We approach the problem by sampling actively finite difference approximations to Hessians of the network. Gathering several approximate Hessians allows reliably to approximate the matrix subspace $\mathcal W$ spanned by symmetric tensors $a_1 \otimes a_1 ,\dots,a_{m_0}\otimes a_{m_0}$ formed by weights of the first layer together with the entangled symmetric tensors $v_1 \otimes v_1 ,\dots,v_{m_1}\otimes v_{m_1}$, formed by suitable combinations of the weights of the first and second layer as $v_\ell=A G_0 b_\ell/\|A G_0 b_\ell\|_2$, $\ell \in [m_1]$, for a diagonal matrix $G_0$ depending on the activation functions of the first layer. The identification of the 1-rank symmetric tensors within $\mathcal W$ is then performed by the solution of a robust nonlinear program. We provide guarantees of stable recovery under a posteriori verifiable conditions. We further address the correct attribution of approximate weights to the first or second layer. By using a suitably adapted gradient descent iteration, it is possible then to estimate, up to intrinsic symmetries, the shifts of the activations functions of the first layer and compute exactly the matrix $G_0$. Our method of identification of the weights of the network is fully constructive, with quantifiable sample complexity, and therefore contributes to dwindle the black-box nature of the network training phase. We corroborate our theoretical results by extensive numerical experiments.

[1]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[2]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[3]  Elina Robeva,et al.  Orthogonal Decomposition of Symmetric Tensors , 2014, SIAM J. Matrix Anal. Appl..

[4]  Holger Rauhut,et al.  A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[5]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[6]  G. Stewart Perturbation theory for the singular value decomposition , 1990 .

[7]  Johan Håstad,et al.  Tensor Rank is NP-Complete , 1989, ICALP.

[8]  V. Koltchinskii,et al.  High Dimensional Probability , 2006, math/0612726.

[9]  J. Stephen Judd,et al.  Neural network design and the complexity of learning , 1990, Neural network modeling and connectionism.

[10]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[11]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[12]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[13]  Vin de Silva,et al.  Tensor rank and the ill-posedness of the best low-rank approximation problem , 2006, math/0607647.

[14]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[15]  L. Schumaker,et al.  Surface Fitting and Multiresolution Methods , 1997 .

[16]  Soumendu Sundar Mukherjee,et al.  Weak convergence and empirical processes , 2019 .

[17]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[18]  Ricardo Cao,et al.  Nonparametric Density Estimation , 2013 .

[19]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[20]  Franz Rellich,et al.  Perturbation Theory of Eigenvalue Problems , 1969 .

[21]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[22]  J. Magnus On Differentiating Eigenvalues and Eigenvectors , 1985, Econometric Theory.

[23]  Allan Pinkus Approximating by Ridge Functions , 1997 .

[24]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[25]  Alexander Cloninger,et al.  Provable approximation properties for deep neural networks , 2015, ArXiv.

[26]  Anima Anandkumar,et al.  Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates , 2014, ArXiv.

[27]  Mark Rudelson,et al.  Sampling from large matrices: An approach through geometric functional analysis , 2005, JACM.

[28]  Paul G. Constantine,et al.  Active Subspaces - Emerging Ideas for Dimension Reduction in Parameter Studies , 2015, SIAM spotlights.

[29]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  P. Petrushev Approximation by ridge functions and neural networks , 1999 .

[31]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[32]  T. Tao Topics in Random Matrix Theory , 2012 .

[33]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[34]  P. Wedin Perturbation bounds in connection with singular value decomposition , 1972 .

[35]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[36]  Jan Vybíral,et al.  Entropy and Sampling Numbers of Classes of Ridge Functions , 2013, 1311.2005.

[37]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[38]  H. Ichimura,et al.  SEMIPARAMETRIC LEAST SQUARES (SLS) AND WEIGHTED SLS ESTIMATION OF SINGLE-INDEX MODELS , 1993 .

[39]  John Wright,et al.  Finding a Sparse Vector in a Subspace: Linear Sparsity Using Alternating Directions , 2014, IEEE Transactions on Information Theory.

[40]  Klaus-Robert Müller,et al.  Interpretable deep neural networks for single-trial EEG classification , 2016, Journal of Neuroscience Methods.

[41]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[42]  Jan Vybíral,et al.  Identification of Shallow Neural Networks by Fewest Samples , 2018, Information and Inference: A Journal of the IMA.

[43]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[44]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[45]  C. Stein Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[46]  John J. Benedetto,et al.  Finite Normalized Tight Frames , 2003, Adv. Comput. Math..

[47]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[48]  Andrea Montanari,et al.  On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition , 2018, AISTATS.

[49]  Johannes Stallkamp,et al.  Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition , 2012, Neural Networks.

[50]  C. Fefferman Reconstructing a neural net from its output , 1994 .

[51]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[52]  W. Light Ridge Functions, Sigmoidal Functions and Neural Networks , 1993 .

[53]  Tamara G. Kolda,et al.  Symmetric Orthogonal Tensor Decomposition is Trivial , 2015, ArXiv.

[54]  Christopher J. Hillar,et al.  Most Tensor Problems Are NP-Hard , 2009, JACM.

[55]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[56]  Peter G. Casazza,et al.  Contemporary Mathematics Classes of Finite Equal Norm Parseval Frames , 2008 .

[57]  Helmut Bölcskei,et al.  Deep Neural Network Approximation Theory , 2019, IEEE Transactions on Information Theory.

[58]  I. Daubechies,et al.  Capturing Ridge Functions in High Dimensions from Point Queries , 2012 .

[59]  Ker-Chau Li,et al.  On Principal Hessian Directions for Data Visualization and Dimension Reduction: Another Application of Stein's Lemma , 1992 .

[60]  Jürgen Schmidhuber,et al.  Multi-column deep neural network for traffic sign classification , 2012, Neural Networks.

[61]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[62]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[63]  Xin Li Interpolation by ridge polynomials and its application in neural networks , 2002 .

[65]  Jan Vybíral,et al.  Learning Functions of Few Arbitrary Linear Parameters in High Dimensions , 2010, Found. Comput. Math..

[66]  A. Pinkus Ridge Functions: Approximation Algorithms , 2015 .

[67]  Philipp Grohs,et al.  Energy Propagation in Deep Convolutional Neural Networks , 2017, IEEE Transactions on Information Theory.

[68]  慧 廣瀬 A Mathematical Introduction to Compressive Sensing , 2015 .

[69]  W. Marsden I and J , 2012 .

[70]  André Uschmajew,et al.  Finding a low-rank basis in a matrix subspace , 2015, Math. Program..

[71]  Alex Gittens,et al.  TAIL BOUNDS FOR ALL EIGENVALUES OF A SUM OF RANDOM MATRICES , 2011, 1104.4513.