Implicit Regularization in Tensor Factorization

Recent efforts to unravel the mystery of implicit regularization in deep learning have led to a theoretical focus on matrix factorization — matrix completion via linear neural network. As a step further towards practical deep learning, we provide the first theoretical analysis of implicit regularization in tensor factorization — tensor completion via certain type of non-linear neural network. We circumvent the notorious difficulty of tensor problems by adopting a dynamical systems perspective, and characterizing the evolution induced by gradient descent. The characterization suggests a form of greedy low tensor rank search, which we rigorously prove under certain conditions, and empirically demonstrate under others. Motivated by tensor rank capturing the implicit regularization of a non-linear neural network, we empirically explore it as a measure of complexity, and find that it captures the essence of datasets on which neural networks generalize. This leads us to believe that tensor rank may pave way to explaining both implicit regularization in deep learning, and the properties of real-world data translating this implicit regularization to generalization.

[1]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[2]  Implicit Regularization of Normalization Methods , 2019, ArXiv.

[3]  Amnon Shashua,et al.  Inductive Bias of Deep Convolutional Networks through Pooling Geometry , 2016, ICLR.

[4]  Amnon Shashua,et al.  Benefits of Depth for Long-Term Memory of Recurrent Networks , 2017, ICLR.

[5]  Amnon Shashua,et al.  SimNets: A Generalization of Convolutional Networks , 2014, ArXiv.

[6]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[7]  Max Simchowitz,et al.  Low-rank Solutions of Linear Matrix Equations via Procrustes Flow , 2015, ICML.

[8]  Valentin Khrulkov,et al.  Generalized Tensor Models for Recurrent Neural Networks , 2019, ICLR.

[9]  Edgar Dobriban,et al.  The Implicit Regularization of Stochastic Gradient Flow for Least Squares , 2020, ICML.

[10]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[11]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[12]  Tengyu Ma,et al.  Beyond Lazy Training for Over-parameterized Tensor Decomposition , 2020, NeurIPS.

[13]  Amit Daniely,et al.  The Implicit Bias of Depth: How Incremental Learning Drives Generalization , 2020, ICLR.

[14]  Hossein Mobahi,et al.  A Unifying View on Implicit Bias in Training Linear Neural Networks , 2021, ICLR.

[15]  Nathan Srebro,et al.  Kernel and Rich Regimes in Overparametrized Models , 2019, COLT.

[16]  Amnon Shashua,et al.  Convolutional Rectifier Networks as Generalized Tensor Decompositions , 2016, ICML.

[17]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[18]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[19]  Kaifeng Lyu,et al.  Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , 2021, ICLR.

[20]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[21]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[22]  Justin K. Romberg,et al.  An Overview of Low-Rank Matrix Recovery From Incomplete Observations , 2016, IEEE Journal of Selected Topics in Signal Processing.

[23]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[24]  Nathan Srebro,et al.  Convergence of Gradient Descent on Separable Data , 2018, AISTATS.

[25]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[26]  Zhi-Qin John Xu,et al.  Understanding training and generalization in deep learning by Fourier analysis , 2018, ArXiv.

[27]  Amnon Shashua,et al.  Deep SimNets , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Hisashi Kashima,et al.  Tensor factorization using auxiliary information , 2011, Data Mining and Knowledge Discovery.

[29]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[30]  Amnon Shashua,et al.  Quantum Entanglement in Deep Learning Architectures. , 2018, Physical review letters.

[31]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[32]  Ohad Shamir,et al.  Implicit Regularization in ReLU Networks with the Square Loss , 2020, COLT.

[33]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[34]  Amnon Shashua,et al.  Deep Learning and Quantum Entanglement: Fundamental Connections with Implications to Network Design , 2017, ICLR.

[35]  T. Sideris Ordinary Differential Equations and Dynamical Systems , 2013 .

[36]  Ming Yuan,et al.  On Polynomial Time Methods for Exact Low-Rank Tensor Completion , 2017, Found. Comput. Math..

[37]  James Caverlee,et al.  Tensor Completion Algorithms in Big Data Analytics , 2017, ACM Trans. Knowl. Discov. Data.

[38]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[39]  Amnon Shashua,et al.  Limits to Depth Efficiencies of Self-Attention , 2020, NeurIPS.

[40]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[41]  Armin Eftekhari,et al.  Implicit Regularization in Matrix Sensing: A Geometric View Leads to Stronger Results , 2020, ArXiv.

[42]  H. Vincent Poor,et al.  Nonconvex Low-Rank Tensor Completion from Noisy Data , 2019, NeurIPS.

[43]  Christopher J. Hillar,et al.  Most Tensor Problems Are NP-Hard , 2009, JACM.

[44]  Tomer Michaeli,et al.  Unique Properties of Wide Minima in Deep Networks , 2020, ICML 2020.

[45]  Yoshua Bengio,et al.  On the Spectral Bias of Deep Neural Networks , 2018, ArXiv.

[46]  Pan Zhou,et al.  Tensor Factorization for Low-Rank Tensor Completion , 2018, IEEE Transactions on Image Processing.

[47]  Surya Ganguli,et al.  An analytic theory of generalization dynamics and transfer learning in deep linear networks , 2018, ICLR.

[48]  Ronen Tamari,et al.  Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions , 2017, ICLR.

[49]  Tamara G. Kolda,et al.  Scalable Tensor Factorizations for Incomplete Data , 2010, ArXiv.

[50]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[51]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[52]  Rudolf Mathar,et al.  A Tensor Analysis on Dense Connectivity via Convolutional Arithmetic Circuits , 2018 .

[53]  Suriya Gunasekar,et al.  Implicit Regularization and Convergence for Weight Normalization , 2019, Neural Information Processing Systems.

[54]  Francis Bach,et al.  Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks , 2019, NeurIPS.

[55]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[56]  Ivan Oseledets,et al.  Expressive power of recurrent neural networks , 2017, ICLR.

[57]  Ronen Tamari,et al.  Tensorial Mixture Models , 2017, ArXiv.

[58]  Hachem Kadri,et al.  Implicit Regularization in Deep Tensor Factorization , 2021, 2021 International Joint Conference on Neural Networks (IJCNN).

[59]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[60]  Amnon Shashua,et al.  On the Expressive Power of Overlapping Architectures of Deep Learning , 2017, ICLR.

[61]  Reinhold Schneider,et al.  Low rank tensor recovery via iterative hard thresholding , 2016, ArXiv.

[62]  Taiji Suzuki,et al.  Understanding Generalization in Deep Learning via Tensor Methods , 2020, AISTATS.

[63]  H. Rauhut,et al.  Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit Bias towards Low Rank , 2020, SSRN Electronic Journal.

[64]  Andrzej Cichocki,et al.  Smooth PARAFAC Decomposition for Tensor Completion , 2015, IEEE Transactions on Signal Processing.

[65]  Alon Brutzkus,et al.  On the Inductive Bias of a CNN for Orthogonal Patterns Distributions , 2020, ArXiv.

[66]  Yuxin Chen,et al.  Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview , 2018, IEEE Transactions on Signal Processing.

[67]  W. Hackbusch Tensor Spaces and Numerical Tensor Calculus , 2012, Springer Series in Computational Mathematics.

[68]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[69]  Prateek Jain,et al.  Provable Tensor Factorization with Missing Data , 2014, NIPS.

[70]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[71]  Yuxin Chen,et al.  Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval and Matrix Completion , 2018, ICML.

[72]  Lars Karlsson,et al.  Parallel algorithms for tensor completion in the CP format , 2016, Parallel Comput..

[73]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[74]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[75]  Xingguo Li,et al.  On Recoverability of Randomly Compressed Tensors With Low CP Rank , 2020, IEEE Signal Processing Letters.

[76]  Nadav Cohen,et al.  On the Expressive Power of Deep Learning: A Tensor Analysis , 2015, COLT 2016.

[77]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[78]  Ronen Tamari,et al.  Analysis and Design of Convolutional Networks via Hierarchical Tensor Decompositions , 2017, ArXiv.

[79]  Florent Krzakala,et al.  Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup , 2019, NeurIPS.

[80]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..