Towards big topic modeling

To solve the big topic modeling problem, we need to reduce both time and space complexities of batch latent Dirichlet allocation (LDA) algorithms. Although parallel LDA algorithms on the multi-processor architecture have low time and space complexities, their communication costs among processors often scale linearly with the vocabulary size and the number of topics, leading to a serious scalability problem. To reduce the communication complexity among processors for a better scalability, we propose a novel communication-efficient parallel topic modeling architecture based on power law, which consumes orders of magnitude less communication time when the number of topics is large. We combine the proposed communication-efficient parallel architecture with the online belief propagation (OBP) algorithm referred to as POBP for big topic modeling tasks. Extensive empirical results confirm that POBP has the following advantages to solve the big topic modeling problem: 1) high accuracy, 2) communication-efficient, 3) fast speed, and 4) constant memory usage when compared with recent state-of-the-art parallel LDA algorithms on the multi-processor architecture.

[1]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[2]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[3]  Qiang Yang,et al.  Scalable Parallel EM Algorithms for Latent Dirichlet Allocation in Multi-Core Systems , 2015, WWW.

[4]  Samuel Kaski,et al.  Mining massive document collections by the WEBSOM method , 2004, Inf. Sci..

[5]  Yang Gao,et al.  Communication-Efficient Parallel Belief Propagation for Latent Dirichlet Allocation , 2012, ArXiv.

[6]  Yang Gao,et al.  A Comparative Study on Parallel LDA Algorithms in MapReduce Framework , 2015, PAKDD.

[7]  Jordan Boyd-Graber,et al.  Online Latent Dirichlet Allocation with Infinite Vocabulary , 2013, ICML.

[8]  Eric P. Xing,et al.  A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks , 2013, NIPS.

[9]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[10]  Nando de Freitas,et al.  Bayesian Latent Semantic Analysis of Multimedia Databases , 2001 .

[11]  Wenyin Liu,et al.  A short text modeling method combining semantic and statistical information , 2010, Inf. Sci..

[12]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[13]  Marcelo R. Campo,et al.  EasySOC: Making web service outsourcing easier , 2014, Inf. Sci..

[14]  Jia Zeng,et al.  Residual Belief Propagation for Topic Modeling , 2012, ADMA.

[15]  Zhiyuan Liu,et al.  PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing , 2011, TIST.

[16]  Robert M. Sanders THE PARETO PRINCIPLE: ITS USE AND ABUSE , 1987 .

[17]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[18]  David M. Blei,et al.  Sparse stochastic inference for latent Dirichlet allocation , 2012, ICML.

[19]  Jen-Tzung Chien,et al.  Adaptive Bayesian Latent Semantic Analysis , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Dominik Slezak,et al.  Processing and mining complex data streams , 2014, Inf. Sci..

[21]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[22]  H. Robbins A Stochastic Approximation Method , 1951 .

[23]  Ian McGraw,et al.  Residual Belief Propagation: Informed Scheduling for Asynchronous Message Passing , 2006, UAI.

[24]  Yanghui Rao,et al.  Sentiment topic models for social emotion mining , 2014, Inf. Sci..

[25]  Miki Haseyama,et al.  LDA-based music recommendation with CF-based similar user selection , 2015, 2015 IEEE 4th Global Conference on Consumer Electronics (GCCE).

[26]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[27]  Inderjit S. Dhillon,et al.  A Scalable Asynchronous Distributed Algorithm for Topic Modeling , 2014, WWW.

[28]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  O. Cappé,et al.  On‐line expectation–maximization algorithm for latent data models , 2009 .

[30]  Jiming Liu,et al.  Learning Topic Models by Belief Propagation , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Feng Yan,et al.  Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units , 2009, NIPS.

[32]  Chih-Jen Lin,et al.  A fast parallel SGD for matrix factorization in shared memory systems , 2013, RecSys.

[33]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[34]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[35]  Jia Zeng,et al.  Fast Online EM for Big Topic Modeling , 2016, IEEE Transactions on Knowledge and Data Engineering.

[36]  Léon Bottou,et al.  On-line learning and stochastic approximations , 1999 .

[37]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[38]  Meng-Sung Wu Modeling query-document dependencies with topic language models for information retrieval , 2015, Inf. Sci..

[39]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[40]  Jia Zeng,et al.  A topic modeling toolbox using belief propagation , 2012, J. Mach. Learn. Res..

[41]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[42]  David M. Blei,et al.  Introduction to Probabilistic Topic Models , 2010 .

[43]  Kristian Kersting,et al.  Larger Residuals, Less Work: Active Document Scheduling for Latent Dirichlet Allocation , 2011, ECML/PKDD.

[44]  Fuji Ren,et al.  Class-indexing-based term weighting for automatic text classification , 2013, Inf. Sci..

[45]  Dan Klein,et al.  Online EM for Unsupervised Models , 2009, NAACL.

[46]  Jia Zeng,et al.  A New Approach to Speeding Up Topic Modeling , 2012, ArXiv.

[47]  Jia Zeng,et al.  Online Belief Propagation for Topic Modeling , 2012, ArXiv.

[48]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[49]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[50]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[51]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.

[52]  Fei-Fei Li,et al.  Spatially Coherent Latent Topic Model for Concurrent Segmentation and Classification of Objects and Scenes , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[53]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..