Stable Recovery of Entangled Weights: Towards Robust Identification of Deep Neural Networks from Minimal Samples

In this paper we approach the problem of unique and stable identifiability of generic deep artificial neural networks with pyramidal shape and smooth activation functions from a finite number of input-output samples. More specifically we introduce the so-called entangled weights, which compose weights of successive layers intertwined with suitable diagonal and invertible matrices depending on the activation functions and their shifts. We prove that entangled weights are completely and stably approximated by an efficient and robust algorithm as soon as O(D ×m) nonadaptive input-output samples of the network are collected, where D is the input dimension and m is the number of neurons of the network. Moreover, we empirically observe that the approach applies to networks with up to O(D × mL) neurons, wheremL is the number of output neurons at layer L. Provided knowledge of layer assignments of entangled weights and of remaining scaling and shift parameters, which may be further heuristically obtained by least squares, the entangled weights identify the network completely and uniquely. To highlight the relevance of the theoretical result of stable recovery of entangled weights, we present numerical experiments, which demonstrate that multilayered networks with generic weights can be robustly identified and therefore uniformly approximated by the presented algorithmic pipeline. In contrast backpropagation cannot generalize stably very well in this setting, being always limited by relatively large uniform error. In terms of practical impact, our study shows that we can relate input-output information uniquely and stably to network parameters, providing a form of explainability. Moreover, our method paves the way for compression of overparametrized networks and for the training of minimal complexity networks.

[1]  Verner Vlavci'c,et al.  Affine symmetries and neural network identifiability , 2021 .

[2]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[3]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[4]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[5]  Levent Tunçel,et al.  Optimization algorithms on matrix manifolds , 2009, Math. Comput..

[6]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[7]  Jan Vybíral,et al.  Learning Functions of Few Arbitrary Linear Parameters in High Dimensions , 2010, Found. Comput. Math..

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[10]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[11]  G. Petrova,et al.  Nonlinear Approximation and (Deep) ReLU\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {ReLU}$$\end{document} , 2019, Constructive Approximation.

[12]  Suvrit Sra,et al.  Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity , 2018, NeurIPS.

[13]  Philipp Petersen,et al.  Optimal approximation of piecewise smooth functions using deep ReLU neural networks , 2017, Neural Networks.

[14]  Joao M. Pereira,et al.  Subspace power method for symmetric tensor decomposition and generalized PCA , 2019, ArXiv.

[15]  Verner Vlavci'c,et al.  Neural Network Identifiability for a Family of Sigmoidal Nonlinearities , 2019, Constructive Approximation.

[16]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[17]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[18]  Helmut Bölcskei,et al.  Deep Neural Network Approximation Theory , 2019, IEEE Transactions on Information Theory.

[19]  Helmut Bölcskei,et al.  Optimal Approximation with Sparsely Connected Deep Neural Networks , 2017, SIAM J. Math. Data Sci..

[20]  Alexander Cloninger,et al.  Provable approximation properties for deep neural networks , 2015, ArXiv.

[21]  Roman Vershynin,et al.  Memory Capacity of Neural Networks with Threshold and Rectified Linear Unit Activations , 2020, SIAM J. Math. Data Sci..

[22]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[23]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[24]  Alex Gittens,et al.  TAIL BOUNDS FOR ALL EIGENVALUES OF A SUM OF RANDOM MATRICES , 2011, 1104.4513.

[25]  Marco Mondelli,et al.  Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks , 2020, ICML.

[26]  Massimo Fornasier,et al.  Robust and Resource-Efficient Identification of Two Hidden Layer Neural Networks , 2019, Constructive Approximation.

[27]  Nathan Srebro,et al.  Kernel and Rich Regimes in Overparametrized Models , 2019, COLT.

[28]  J. Stephen Judd,et al.  On the complexity of loading shallow neural networks , 1988, J. Complex..

[29]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[30]  Adam R. Klivans,et al.  Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection , 2020, ICML.

[31]  H. Rauhut,et al.  Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers , 2019, Information and Inference: A Journal of the IMA.

[32]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[33]  Samet Oymak,et al.  Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks , 2019, IEEE Journal on Selected Areas in Information Theory.

[34]  Wei Hu,et al.  A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.

[35]  Marco Mondelli,et al.  Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology , 2020, NeurIPS.

[36]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[37]  Geoffrey E. Hinton,et al.  Learning Representations by Recirculation , 1987, NIPS.

[38]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[39]  G. Stewart Perturbation theory for the singular value decomposition , 1990 .

[40]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[41]  Andrea Montanari,et al.  On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition , 2018, AISTATS.

[42]  Jan Vybíral,et al.  Identification of Shallow Neural Networks by Fewest Samples , 2018, Information and Inference: A Journal of the IMA.

[43]  Nathan Srebro,et al.  Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy , 2020, NeurIPS.

[44]  Boris Hanin,et al.  Neural network approximation , 2020, Acta Numerica.

[45]  A. Pinkus,et al.  Identifying Linear Combinations of Ridge Functions , 1999 .

[46]  Ruoyu Sun,et al.  Optimization for deep learning: theory and algorithms , 2019, ArXiv.

[47]  C. Chui,et al.  Approximation by ridge functions and neural networks with one hidden layer , 1992 .

[48]  Constantine Caramanis,et al.  Robust PCA via Outlier Pursuit , 2010, IEEE Transactions on Information Theory.

[49]  Eduardo D. Sontag,et al.  UNIQUENESS OF WEIGHTS FOR NEURAL NETWORKS , 1993 .

[50]  C. Fefferman Reconstructing a neural net from its output , 1994 .

[51]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[52]  Héctor J. Sussmann,et al.  Uniqueness of the weights for minimal feedforward nets with a given input-output map , 1992, Neural Networks.

[53]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[54]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[55]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[56]  Alexander Cloninger,et al.  ReLU nets adapt to intrinsic dimensionality beyond the target domain , 2020, ArXiv.

[57]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[58]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[59]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[60]  Yann LeCun,et al.  Une procedure d'apprentissage pour reseau a seuil asymmetrique (A learning scheme for asymmetric threshold networks) , 1985 .

[61]  Arnulf Jentzen,et al.  Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations , 2018, SIAM J. Math. Data Sci..

[62]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[63]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[64]  H. N. Mhaskar,et al.  Function approximation by deep networks , 2019, ArXiv.

[65]  J. Knott The organization of behavior: A neuropsychological theory , 1951 .

[66]  P. Wedin Perturbation bounds in connection with singular value decomposition , 1972 .

[67]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[68]  David Rolnick,et al.  Reverse-engineering deep ReLU networks , 2019, ICML.

[69]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[70]  Prateek Jain,et al.  Non-convex Robust PCA , 2014, NIPS.

[71]  Anima Anandkumar,et al.  Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates , 2014, ArXiv.

[72]  Guang-Bin Huang,et al.  Learning capability and storage capacity of two-hidden-layer feedforward networks , 2003, IEEE Trans. Neural Networks.

[73]  Arnulf Jentzen,et al.  DNN Expression Rate Analysis of High-Dimensional PDEs: Application to Option Pricing , 2018, Constructive Approximation.