Extreme Stochastic Variational Inference: Distributed Inference for Large Scale Mixture Models

Mixture of exponential family models are among the most fundamental and widely used statistical models. Stochastic variational inference (SVI), the state-of-the-art algorithm for parameter estimation in such models is inherently serial. Moreover, it requires the parameters to fit in the memory of a single processor; this poses serious limitations on scalability when the number of parameters is in billions. In this paper, we present extreme stochastic variational inference (ESVI), a distributed, asynchronous and lock-free algorithm to perform variational inference for mixture models on massive real world datasets. ESVI overcomes the limitations of SVI by requiring that each processor only access a subset of the data and a subset of the parameters, thus providing data and model parallelism simultaneously. Our empirical study demonstrates that ESVI not only outperforms VI and SVI in wallclock-time, but also achieves a better quality solution. To further speed up computation and save memory when fitting large number of topics, we propose a variant ESVI-TOPK which maintains only the top-k important topics. Empirically, we found that using top 25% topics suffices to achieve the same accuracy as storing all the topics.

[1]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[2]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[3]  Inderjit S. Dhillon,et al.  NOMAD: Nonlocking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion , 2013, Proc. VLDB Endow..

[4]  Yee Whye Teh,et al.  Distributed Bayesian Learning with Stochastic Natural Gradient Expectation Propagation and the Posterior Server , 2015, J. Mach. Learn. Res..

[5]  C. Archambeau,et al.  Incremental Variational Inference for Latent Dirichlet Allocation , 2015, 1507.05016.

[6]  Charles M. Bishop,et al.  Variational Message Passing , 2005, J. Mach. Learn. Res..

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[9]  Chong Wang,et al.  Embarrassingly Parallel Variational Inference in Nonconjugate Models , 2015, ArXiv.

[10]  Michael I. Jordan Graphical Models , 1998 .

[11]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[12]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[13]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[14]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[15]  Andrés R. Masegosa,et al.  Scaling up Bayesian variational inference using distributed computing clusters , 2017, Int. J. Approx. Reason..

[16]  Erik B. Sudderth,et al.  Memoized Online Variational Inference for Dirichlet Process Mixture Models , 2013, NIPS.

[17]  Chong Wang,et al.  Variational inference in nonconjugate models , 2012, J. Mach. Learn. Res..

[18]  Inderjit S. Dhillon,et al.  A Scalable Asynchronous Distributed Algorithm for Topic Modeling , 2014, WWW.

[19]  Andre Wibisono,et al.  Streaming Variational Bayes , 2013, NIPS.