Theoretical Analysis of Bayesian Matrix Factorization

Recently, variational Bayesian (VB) techniques have been applied to probabilistic matrix factorization and shown to perform very well in experiments. In this paper, we theoretically elucidate properties of the VB matrix factorization (VBMF) method. Through finite-sample analysis of the VBMF estimator, we show that two types of shrinkage factors exist in the VBMF estimator: the positive-part James-Stein (PJS) shrinkage and the trace-norm shrinkage, both acting on each singular component separately for producing low-rank solutions. The trace-norm shrinkage is simply induced by non-flat prior information, similarly to the maximum a posteriori (MAP) approach. Thus, no trace-norm shrinkage remains when priors are non-informative. On the other hand, we show a counter-intuitive fact that the PJS shrinkage factor is kept activated even with flat priors. This is shown to be induced by the non-identifiability of the matrix factorization model, that is, the mapping between the target matrix and factorized matrices is not one-to-one. We call this model-induced regularization. We further extend our analysis to empirical Bayes scenarios where hyperparameters are also learned based on the VB free energy. Throughout the paper, we assume no missing entry in the observed matrix, and therefore collaborative filtering is out of scope.

[1]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[2]  Shin Ishii,et al.  Dynamic Exponential Family Matrix Factorization , 2009, PAKDD.

[3]  Juha Karhunen,et al.  Principal Component Analysis for Large Scale Problems with Lots of Missing Values , 2007, ECML.

[4]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[5]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[6]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[7]  Nabendu Pal,et al.  A sequence of improvements over the James-Stein estimator , 1992 .

[8]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[9]  Sumio Watanabe,et al.  Singularities in mixture models and upper bounds of stochastic complexity , 2003, Neural Networks.

[10]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[11]  Kurt Hornik,et al.  Learning in linear neural networks: a survey , 1995, IEEE Trans. Neural Networks.

[12]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[13]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[14]  I. Olkin,et al.  Inequalities: Theory of Majorization and Its Applications , 1980 .

[15]  W. Rey,et al.  On Weighted Low-Rank Approximation , 2013, 1302.0360.

[16]  S. Puntanen Inequalities: Theory of Majorization and Its Applications, Second Edition by Albert W. Marshall, Ingram Olkin, Barry C. Arnold , 2011 .

[17]  A. Gelman Parameterization and Bayesian Modeling , 2004 .

[18]  Anton Schwaighofer,et al.  Learning Gaussian processes from multiple tasks , 2005, ICML.

[19]  Kazuho Watanabe,et al.  Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation , 2006, J. Mach. Learn. Res..

[20]  Clifford S. Stein Estimation of a covariance matrix , 1975 .

[21]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[22]  W. Strawderman Proper Bayes Minimax Estimators of the Multivariate Normal Mean , 1971 .

[23]  J. Neyman,et al.  INADMISSIBILITY OF THE USUAL ESTIMATOR FOR THE MEAN OF A MULTIVARIATE NORMAL DISTRIBUTION , 2005 .

[24]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[25]  Xuelong Li,et al.  Bayesian Tensor Approach for 3-D Face Modeling , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .

[27]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[28]  William E. Strawderman,et al.  Improving on the James-Stein Positive-Part Estimator , 1994 .

[29]  Arkadiusz Paterek,et al.  Improving regularized singular value decomposition for collaborative filtering , 2007 .

[30]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[31]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[32]  Tommi S. Jaakkola,et al.  Weighted Low-Rank Approximations , 2003, ICML.

[33]  J. Besag On the Statistical Analysis of Dirty Pictures , 1986 .

[34]  G. Reinsel,et al.  Multivariate Reduced-Rank Regression: Theory and Applications , 1998 .

[35]  Jieping Ye,et al.  Probabilistic Interpretations and Extensions for a Family of 2D PCA-style Algorithms , 2008 .

[36]  Shinichi Nakajima,et al.  Implicit Regularization in Variational Bayesian Matrix Factorization , 2010, ICML.

[37]  Karl J. Friston,et al.  Characterizing the Response of PET and fMRI Data Using Multivariate Linear Models , 1997, NeuroImage.

[38]  R. Kass,et al.  Shrinkage Estimators for Covariance Matrices , 2001, Biometrics.

[39]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[40]  Sumio Watanabe,et al.  Algebraic Geometry and Statistical Learning Theory: Contents , 2009 .

[41]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[42]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[43]  B. Efron,et al.  Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[44]  Nathan Srebro,et al.  Fast maximum margin matrix factorization for collaborative prediction , 2005, ICML.

[45]  Declan Fleming Try this at home , 2013 .

[46]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[47]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[48]  Wei Chu,et al.  Probabilistic Models for Incomplete Multi-dimensional Arrays , 2009, AISTATS.

[49]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[50]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[51]  Yew Jin Lim Variational Bayesian Approach to Movie Rating Prediction , 2007 .

[52]  Bradley N. Miller,et al.  GroupLens: applying collaborative filtering to Usenet news , 1997, CACM.

[53]  Sumio Watanabe,et al.  Algebraic Analysis for Nonidentifiable Learning Machines , 2001, Neural Computation.

[54]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[55]  Shinichi Nakajima,et al.  Variational Bayes Solution of Linear Neural Networks and Its Generalization Performance , 2007, Neural Computation.

[56]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[57]  Zaïd Harchaoui,et al.  A Machine Learning Approach to Conjoint Analysis , 2004, NIPS.