Stochastic variational inference

We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets.

[1]  Wm. R. Wright General Intelligence, Objectively Determined and Measured. , 1905 .

[2]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[3]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[4]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[5]  R. E. Kalman,et al.  A New Approach to Linear Filtering and Prediction Problems , 2002 .

[6]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[7]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[8]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[9]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[10]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[11]  Empirical Bayes Methods , 1974 .

[12]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  S. Amari Differential Geometry of Curved Exponential Families-Curvatures and Information Loss , 1982 .

[15]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Carsten Peterson,et al.  A Mean Field Theory Learning Algorithm for Neural Networks , 1987, Complex Syst..

[17]  Bernard Widrow,et al.  Adaptive switching circuits , 1988 .

[18]  G. Parisi,et al.  Statistical Field Theory , 1988 .

[19]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[20]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[21]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[22]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[23]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[24]  Jayaran Sethuramant A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[25]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[26]  Steve R. Waterhouse,et al.  Bayesian Methods for Mixtures of Experts , 1995, NIPS.

[27]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[28]  Michael I. Jordan,et al.  Exploiting Tractable Substructures in Intractable Networks , 1995, NIPS.

[29]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[30]  Michael I. Jordan,et al.  Variational methods for inference and estimation in graphical models , 1997 .

[31]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[32]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[33]  Léon Bottou,et al.  On-line learning and stochastic approximations , 1999 .

[34]  Hagai Attias,et al.  A Variational Bayesian Framework for Graphical Models , 1999 .

[35]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[36]  Zoubin Ghahramani,et al.  Variational Inference for Bayesian Mixtures of Factor Analysers , 1999, NIPS.

[37]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[38]  Shin Ishii,et al.  On-line EM Algorithm for the Normalized Gaussian Network , 2000, Neural Computation.

[39]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[40]  Zoubin Ghahramani,et al.  Propagation Algorithms for Variational Bayesian Learning , 2000, NIPS.

[41]  Wim Wiegerinck,et al.  Variational Approximations between Mean Field Theory and the Junction Tree Algorithm , 2000, UAI.

[42]  Nando de Freitas,et al.  An Introduction to Sequential Monte Carlo Methods , 2001, Sequential Monte Carlo Methods in Practice.

[43]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[44]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[45]  Masa-aki Sato,et al.  Online Model Selection Based on the Variational Bayes , 2001, Neural Computation.

[46]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[47]  David J. Spiegelhalter,et al.  VIBES: A Variational Inference Engine for Bayesian Networks , 2002, NIPS.

[48]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[49]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[50]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[51]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[52]  Michael I. Jordan,et al.  A generalized mean field algorithm for variational inference in exponential families , 2002, UAI.

[53]  Léon Bottou,et al.  Stochastic Learning , 2003, Advanced Lectures on Machine Learning.

[54]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[55]  Tim Hesterberg,et al.  Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control , 2004, Technometrics.

[56]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[57]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[58]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[60]  Christian P. Robert,et al.  Monte Carlo Statistical Methods (Springer Texts in Statistics) , 2005 .

[61]  Charles M. Bishop,et al.  Variational Message Passing , 2005, J. Mach. Learn. Res..

[62]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[63]  Mark Girolami,et al.  Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors , 2006, Neural Computation.

[64]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[65]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[66]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[67]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[68]  Yee Whye Teh,et al.  Collapsed Variational Inference for HDP , 2007, NIPS.

[69]  H. Robbins A Stochastic Approximation Method , 1951 .

[70]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[71]  David A. Maltz,et al.  Fast Variational Inference for Large-scale Internet Diagnosis , 2007, NIPS.

[72]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[73]  Juha Karhunen,et al.  Natural Conjugate Gradient in Variational Inference , 2007, ICONIP.

[74]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[75]  Chong Wang,et al.  Variational Bayesian Approach to Canonical Correlation Analysis , 2007, IEEE Transactions on Neural Networks.

[76]  Léon Bottou,et al.  Learning using Large Datasets , 2007, NATO ASI Mining Massive Data Sets for Security.

[77]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[78]  Michael I. Jordan,et al.  An HDP-HMM for systems with state persistence , 2008, ICML '08.

[79]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[80]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[81]  Sunita Sarawagi Learning with Graphical Models , 2008 .

[82]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[83]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[84]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[85]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[86]  Lawrence Carin,et al.  Nonparametric factor analysis with beta process priors , 2009, ICML '09.

[87]  Dan Klein,et al.  Learning Semantic Correspondences with Less Supervision , 2009, ACL.

[88]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[89]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[90]  O. Cappé,et al.  On‐line expectation–maximization algorithm for latent data models , 2009 .

[91]  Lawrence Carin,et al.  Hidden Markov Models With Stick-Breaking Priors , 2009, IEEE Transactions on Signal Processing.

[92]  Perry R. Cook,et al.  Bayesian Nonparametric Matrix Factorization for Recorded Music , 2010, ICML.

[93]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[94]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[95]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[96]  P. Müller,et al.  Bayesian Nonparametrics: An invitation to Bayesian nonparametrics , 2010 .

[97]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[98]  Tom Minka,et al.  Non-conjugate Variational Message Passing for Multinomial and Binary Regression , 2011, NIPS.

[99]  Samuel J. Gershman,et al.  A Tutorial on Bayesian Nonparametric Models , 2011, 1106.2697.

[100]  D. Blei,et al.  The Discrete Innite Logistic Normal Distribution , 2011, 1103.4789.

[101]  Michael I. Jordan,et al.  A Sticky HDP-HMM With Application to Speaker Diarization , 2009, 0905.2592.

[102]  Chong Wang,et al.  Online Variational Inference for the Hierarchical Dirichlet Process , 2011, AISTATS.

[103]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[104]  Michael I. Jordan,et al.  Bayesian Nonparametric Inference of Switching Dynamic Linear Models , 2010, IEEE Transactions on Signal Processing.

[105]  Kristian Kersting,et al.  Larger Residuals, Less Work: Active Document Scheduling for Latent Dirichlet Allocation , 2011, ECML/PKDD.

[106]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[107]  David M. Blei,et al.  Sparse stochastic inference for latent Dirichlet allocation , 2012, ICML.

[108]  David M. Blei,et al.  Nonparametric variational inference , 2012, ICML.

[109]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[110]  Michael J. Freedman,et al.  Scalable Inference of Overlapping Communities , 2012, NIPS.

[111]  Michael I. Jordan,et al.  Variational Bayesian Inference with Stochastic Search , 2012, ICML.

[112]  Chong Wang,et al.  An Adaptive Learning Rate for Stochastic Variational Inference , 2013, ICML.

[113]  Chong Wang,et al.  Variational inference in nonconjugate models , 2012, J. Mach. Learn. Res..

[114]  Chong Wang,et al.  Nested Hierarchical Dirichlet Processes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.