Tensor decompositions for learning latent variable models

This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models--including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation--which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

[1]  K. Pearson Contributions to the Mathematical Theory of Evolution , 1894 .

[2]  F. L. Hitchcock The Expression of a Tensor or a Polyadic as a Sum of Products , 1927 .

[3]  F. L. Hitchcock Multiple Invariants and Generalized Rank of a P‐Way Matrix or Tensor , 1928 .

[4]  R. Cattell “Parallel proportional profiles” and other principles for determining the choice of factors by rotation , 1944 .

[5]  Marcel Paul Schützenberger,et al.  On the Definition of a Family of Automata , 1961, Inf. Control..

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[8]  P. Wedin Perturbation bounds in connection with singular value decomposition , 1972 .

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  J. Kruskal Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics , 1977 .

[11]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[12]  L. L. Cam,et al.  Asymptotic methods in statistical theory , 1986 .

[13]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[14]  P. McCullagh Tensor Methods in Statistics , 1987 .

[15]  Jean-Francois Cardoso,et al.  Super-symmetric decomposition of the fourth-order cumulant tensor. Blind identification of more sources than sensors , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[16]  A. Bunse-Gerstner,et al.  Numerical Methods for Simultaneous Diagonalization , 1993, SIAM J. Matrix Anal. Appl..

[17]  J. Cardoso,et al.  Blind beamforming for non-gaussian signals , 1993 .

[18]  S. Leurgans,et al.  A Decomposition for Three-Way Arrays , 1993, SIAM J. Matrix Anal. Appl..

[19]  Jean-Francois Cardoso,et al.  Perturbation of joint diagonalizers , 1994 .

[20]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[21]  Nathalie Delfosse,et al.  Adaptive blind separation of independent sources: A deflation approach , 1995, Signal Process..

[22]  B. Moor,et al.  Subspace identification for linear systems , 1996 .

[23]  Pierre Comon,et al.  Independent component analysis, a survey of some algebraic methods , 1996, 1996 IEEE International Symposium on Circuits and Systems. Circuits and Systems Connecting the World. ISCAS 96.

[24]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[25]  Alan M. Frieze,et al.  Learning linear transformations , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[26]  Robert M. Corless,et al.  A reordered Schur factorization method for zero-dimensional polynomial systems with multiple roots , 1997, ISSAC.

[27]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[28]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[29]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[30]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[31]  Herbert Jaeger,et al.  Observable Operator Models for Discrete Stochastic Time Series , 2000, Neural Computation.

[32]  Joos Vandewalle,et al.  On the Best Rank-1 and Rank-(R1 , R2, ... , RN) Approximation of Higher-Order Tensors , 2000, SIAM J. Matrix Anal. Appl..

[33]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[34]  Gene H. Golub,et al.  Rank-One Approximation to High Order Tensors , 2001, SIAM J. Matrix Anal. Appl..

[35]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[36]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixtures of distributions , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[37]  Phillip A. Regalia,et al.  On the Best Rank-1 Approximation of Higher-Order Supersymmetric Tensors , 2001, SIAM J. Matrix Anal. Appl..

[38]  Phillip A. Regalia,et al.  Monotonic convergence of fixed-point algorithms for ICA , 2003, IEEE Trans. Neural Networks.

[39]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[40]  L. Lathauwer,et al.  On the Best Rank-1 and Rank-( , 2004 .

[41]  Andreas Ziehe,et al.  A Fast Algorithm for Joint Diagonalization with Non-orthogonal Transformations and its Application to Blind Source Separation , 2004, J. Mach. Learn. Res..

[42]  Sanjeev Arora,et al.  LEARNING MIXTURES OF SEPARATED NONSPHERICAL GAUSSIANS , 2005, math/0503457.

[43]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[44]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, STOC '05.

[45]  Lek-Heng Lim,et al.  Singular values and eigenvalues of tensors: a variational approach , 2005, 1st IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, 2005..

[46]  M. Drton,et al.  Algebraic factor analysis: tetrads, pentads and beyond , 2005, math/0509390.

[47]  Liqun Qi,et al.  Eigenvalues of a real supersymmetric tensor , 2005, J. Symb. Comput..

[48]  L. Pachter,et al.  Algebraic Statistics for Computational Biology: Preface , 2005 .

[49]  Sébastien Roch,et al.  A short proof that phylogenetic tree reconstruction by maximum likelihood is hard , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[50]  Lieven De Lathauwer,et al.  Fourth-Order Cumulant-Based Blind Identification of Underdetermined Mixtures , 2007, IEEE Transactions on Signal Processing.

[51]  Sanjoy Dasgupta,et al.  A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians , 2007, J. Mach. Learn. Res..

[52]  Phong Q. Nguyen,et al.  Learning a Parallelepiped: Cryptanalysis of GGH and NTRU Signatures , 2009, Journal of Cryptology.

[53]  Santosh S. Vempala,et al.  The Spectral Method for General Mixture Models , 2008, SIAM J. Comput..

[54]  Santosh S. Vempala,et al.  Isotropic PCA and Affine-Invariant Clustering , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[55]  Gene H. Golub,et al.  Symmetric Tensors and Symmetric Tensor Rank , 2008, SIAM J. Matrix Anal. Appl..

[56]  Tim Austin On exchangeable random variables and the statistics of large graphs and hypergraphs , 2008, 0801.1698.

[57]  Satish Rao,et al.  Learning Mixtures of Product Distributions Using Correlations and Independence , 2008, COLT.

[58]  Sham M. Kakade,et al.  A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[59]  Shang-Hua Teng,et al.  Smoothed analysis: an attempt to explain the behavior of algorithms in practice , 2009, CACM.

[60]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[61]  Pierre Comon,et al.  Subtracting a best rank-1 approximation may increase tensor rank , 2009, 2009 17th European Signal Processing Conference.

[62]  Alper T. Erdogan,et al.  On the Convergence of ICA Algorithms With Symmetric Orthogonalization , 2008, IEEE Transactions on Signal Processing.

[63]  C. Matias,et al.  Identifiability of parameters in latent structure models with many observed variables , 2008, 0809.5032.

[64]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[65]  Byron Boots,et al.  Reduced-Rank Hidden Markov Models , 2009, AISTATS.

[66]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .

[67]  Adam Tauman Kalai,et al.  Efficiently learning mixtures of two Gaussians , 2010, STOC '10.

[68]  Ankur Moitra,et al.  Settling the Polynomial Learnability of Mixtures of Gaussians , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[69]  Mikhail Belkin,et al.  Polynomial Learning of Distribution Families , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[70]  B. Sturmfels,et al.  Binary Cumulant Varieties , 2011, 1103.0153.

[71]  Le Song,et al.  A Spectral Algorithm for Latent Tree Graphical Models , 2011, ICML.

[72]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[73]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[74]  Raphaël Bailly Quadratic Weighted Automata: Spectral Algorithm and Likelihood Maximization , 2011, ACML 2011.

[75]  Tamara G. Kolda,et al.  Shifted Power Method for Computing Tensor Eigenpairs , 2010, SIAM J. Matrix Anal. Appl..

[76]  Byron Boots,et al.  An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems , 2011, AAAI.

[77]  Ariadna Quattoni,et al.  Spectral Learning for Non-Deterministic Dependency Parsing , 2012, EACL.

[78]  Mehryar Mohri,et al.  Spectral Learning of General Weighted Automata via Constrained Matrix Completion , 2012, NIPS.

[79]  Karl Stratos,et al.  Spectral Learning of Latent-Variable PCFGs , 2012, ACL.

[80]  Michael Collins,et al.  Spectral Dependency Parsing with Latent Variables , 2012, EMNLP-CoNLL.

[81]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[82]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[83]  Sham M. Kakade,et al.  Identifiability and Unmixing of Latent Parse Trees , 2012, NIPS.

[84]  Ariadna Quattoni,et al.  Local Loss Optimization in Operator Models: A New Insight into Spectral Learning , 2012, ICML.

[85]  Dean P. Foster,et al.  Spectral dimensionality reduction for HMMs , 2012, ArXiv.

[86]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[87]  Anima Anandkumar,et al.  Learning Mixtures of Tree Graphical Models , 2012, NIPS.

[88]  B. Sturmfels,et al.  The number of eigenvalues of a tensor , 2010, 1004.4953.

[89]  Sham M. Kakade,et al.  Learning mixtures of spherical gaussians: moment methods and spectral decompositions , 2012, ITCS '13.

[90]  Ryan P. Adams,et al.  Contrastive Learning Using Spectral Methods , 2013, NIPS.

[91]  Dean P. Foster,et al.  Using Regression for Spectral Estimation of HMMs , 2013, SLSP.

[92]  Christopher J. Hillar,et al.  Most Tensor Problems Are NP-Hard , 2009, JACM.

[93]  Aditya Bhaskara,et al.  Smoothed analysis of tensor decompositions , 2013, STOC.

[94]  Mikhail Belkin,et al.  The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures , 2013, COLT.

[95]  Anima Anandkumar,et al.  A Spectral Algorithm for Latent Dirichlet Allocation , 2012, Algorithmica.

[96]  Sanjeev Arora,et al.  Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders , 2012, Algorithmica.