A Novel Stochastic Gradient Descent Algorithm for Learning Principal Subspaces

Many machine learning problems encode their data as a matrix with a possibly very large number of rows and columns. In several applications like neuroscience, image compression or deep reinforcement learning, the principal subspace of such a matrix provides a useful, low-dimensional representation of individual data. Here, we are interested in determining the d -dimensional principal subspace of a given matrix from sample entries, i.e. from small random submatrices. Although a number of sample-based methods exist for this problem (e.g. Oja’s rule (Oja, 1982)), these assume access to full columns of the matrix or particular matrix structure such as symmetry and cannot be combined as-is with neural networks (Baldi and Hornik, 1989). In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace is represented by a neural network, and hence can be scaled to datasets with an effectively infinite number of rows and columns. Our method consists in defining a loss function whose minimizer is the desired principal subspace, and constructing a gradient estimate of this loss whose bias can be controlled. We complement our theoretical analysis with a series of experiments on synthetic matrices, the MNIST dataset (LeCun, 1998) and the reinforcement learning domain PuddleWorld (Sutton, 1995) demonstrating the usefulness of our approach.

[1]  Jun Zhu,et al.  NeuralEF: Deconstructing Kernels by Deep Neural Networks , 2022, ICML.

[2]  T. Graepel,et al.  EigenGame Unloaded: When playing games is better than optimizing , 2021, ICLR.

[3]  L. Balzano On the equivalence of Oja's algorithm and GROUSE , 2022, AISTATS.

[4]  Simon S. Du,et al.  Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization , 2021, NeurIPS.

[5]  Clare Lyle,et al.  On The Effect of Auxiliary Tasks on Representation Dynamics , 2021, AISTATS.

[6]  Yann Ollivier,et al.  Learning Successor States and Goal-Dependent Values: A Mathematical Viewpoint , 2021, ArXiv.

[7]  Thore Graepel,et al.  EigenGame: PCA as a Nash Equilibrium , 2020, ICLR.

[8]  Marc G. Bellemare,et al.  The Value-Improvement Path: Towards Better Representations for Reinforcement Learning , 2020, AAAI.

[9]  Manfred K. Warmuth,et al.  An Implicit Form of Krasulina's k-PCA Update without the Orthonormality Constraint , 2019, AAAI.

[10]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[11]  Marc G. Bellemare,et al.  DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[12]  Cheng Tang,et al.  Exponentially convergent stochastic k-PCA without variance reduction , 2019, NeurIPS.

[13]  David Pfau,et al.  Spectral Inference Networks: Unifying Deep and Spectral Learning , 2018, ICLR.

[14]  Marlos C. Machado,et al.  Eigenoption Discovery through the Deep Successor Representation , 2017, ICLR.

[15]  Junwei Lu,et al.  Symmetry. Saddle Points, and Global Optimization Landscape of Nonconvex Matrix Factorization , 2016, 2018 Information Theory and Applications Workshop (ITA).

[16]  Marek Petrik,et al.  Low-rank Feature Selection for Reinforcement Learning , 2018, ISAIM.

[17]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[18]  Sham M. Kakade,et al.  Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent , 2016, NIPS.

[19]  Naman Agarwal,et al.  Second Order Stochastic Optimization in Linear Time , 2016, ArXiv.

[20]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Zhi-Quan Luo,et al.  Guaranteed Matrix Completion via Non-Convex Factorization , 2014, IEEE Transactions on Information Theory.

[23]  Christopher De Sa,et al.  Global Convergence of Stochastic Gradient Descent for Some Non-convex Matrix Problems , 2014, ICML.

[24]  Moritz Hardt,et al.  Understanding Alternating Minimization for Matrix Completion , 2013, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[25]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[26]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[27]  Robert D. Nowak,et al.  Online identification and tracking of subspaces from highly incomplete information , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[28]  Olgica Milenkovic,et al.  SET: An algorithm for consistent matrix completion , 2009, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[30]  Sewoong Oh,et al.  A Gradient Descent Algorithm on the Grassman Manifold for Matrix Completion , 2009, ArXiv.

[31]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[32]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[33]  Travis E. Oliphant,et al.  Python for Scientific Computing , 2007, Computing in Science & Engineering.

[34]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[35]  Tommi S. Jaakkola,et al.  Weighted Low-Rank Approximations , 2003, ICML.

[36]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[37]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[38]  Guido Rossum,et al.  Python Reference Manual , 2000 .

[39]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[40]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[41]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[42]  E. Oja Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[43]  J. Danskin The Theory of Max-Min and its Application to Weapons Allocation Problems , 1967 .