WarpLDA: a Simple and Efficient O(1) Algorithm for Latent Dirichlet Allocation

Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest for many applications. Previous work has developed an O(1) Metropolis-Hastings sampling method for each token. However, the performance is far from being optimal due to random accesses to the parameter matrices and frequent cache misses. In this paper, we propose WarpLDA, a novel O(1) sampling algorithm for LDA. WarpLDA is a Metropolis-Hastings based algorithm which is designed to optimize the cache hit rate. Advantages of WarpLDA include 1) Efficiency and scalability: WarpLDA has good locality and carefully designed partition method, and can be scaled to hundreds of machines; 2) Simplicity: WarpLDA does not have any complicated modules such as alias tables, hybrid data structures, or parameter servers, making it easy to understand and implement; 3) Robustness: WarpLDA is consistently faster than other algorithms, under various settings from small-scale to massivescale dataset and model. WarpLDA is 5-15x faster than state-of-the-art LDA samplers, implying less cost of time and money. With WarpLDA users can learn up to one million topics from hundreds of millions of documents in a few hours, at the speed of 2G tokens per second, or learn topics from small-scale datasets in seconds.

[1]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[2]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[3]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[4]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[5]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  John Canny,et al.  SAME but Different: Fast and High Quality Gibbs Parameter Estimation , 2014, KDD.

[8]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[9]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[10]  Yee Whye Teh,et al.  Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex , 2013, NIPS.

[11]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[12]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models , 2012, J. Mach. Learn. Res..

[13]  John Canny,et al.  BIDMach: Large-scale Learning with Zero Memory Allocation , 2013 .

[14]  Alastair J. Walker,et al.  An Efficient Method for Generating Discrete Random Variables with General Distributions , 1977, TOMS.

[15]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[16]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[17]  Baining Guo,et al.  TopicPanorama: A Full Picture of Relevant Topics , 2014, IEEE Transactions on Visualization and Computer Graphics.

[18]  Ning Chen,et al.  Discriminative Relational Topic Models , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Magiel Bruntink,et al.  MLitB: machine learning in the browser , 2015 .

[20]  Edward Y. Chang,et al.  Collaborative filtering for orkut communities: discovery of user latent behavior , 2009, WWW '09.

[21]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[22]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[23]  Yang Gao,et al.  Towards Topic Modeling for Big Data , 2014, ArXiv.

[24]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[25]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[26]  Andre Wibisono,et al.  Streaming Variational Bayes , 2013, NIPS.

[27]  Naonori Ueda,et al.  Probabilistic latent semantic visualization: topic model for visualizing documents , 2008, KDD.

[28]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[29]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.