Scalable Parallel Topic Models

(U) The topic model is a popular probabilistic model for text and document modeling. It can be used for topic indexing, document classification, corpus summarization and information retrieval. In the past, topic models have been applied to corpora containing thousands to hundreds of thousands of documents. Now there is an increasing need to model collections with millions to billions of documents. We present a parallel algorithm for the topic model that has linear speedup and high parallel efficiency for shared-memory symmetric multiprocessors (SMPs). Using this parallel algorithm, topic model computations on an 8-processor system took 1/7 the time of the same computation on a single processor.

[1]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Erricos John Kontoghiorghes,et al.  Handbook of Parallel Computing and Statistics , 2005 .

[5]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.