Partially collapsed Gibbs sampling for latent Dirichlet allocation

Abstract A latent Dirichlet allocation (LDA) model is a machine learning technique to identify latent topics from text corpora within a Bayesian hierarchical framework. Current popular inferential methods to fit the LDA model are based on variational Bayesian inference, collapsed Gibbs sampling, or a combination of these. Because these methods assume a unimodal distribution over topics, however, they can suffer from large bias when text corpora consist of various clusters with different topic distributions. This paper proposes an inferential LDA method to efficiently obtain unbiased estimates under flexible modeling for heterogeneous text corpora with the method of partial collapse and the Dirichlet process mixtures. The method is illustrated using a simulation study and an application to a corpus of 1300 documents from neural information processing systems (NIPS) conference articles during the period of 2000–2002 and British Broadcasting Corporation (BBC) news articles during the period of 2004–2005.

[1]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[2]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[3]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[4]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[5]  A. Gelfand,et al.  The Nested Dirichlet Process , 2008 .

[6]  Xiyun Jiao,et al.  Metropolis-Hastings Within Partially Collapsed Gibbs Samplers , 2013, 1309.3217.

[7]  Taeyoung Park,et al.  Analysis of Poisson varying-coefficient models with autoregression , 2018 .

[8]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[9]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[10]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[11]  Taeyoung Park,et al.  Partially Collapsed Gibbs Sampling for Linear Mixed-effects Models , 2016, Commun. Stat. Simul. Comput..

[12]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[13]  D. V. van Dyk,et al.  Partially Collapsed Gibbs Samplers: Illustrations and Applications , 2009 .

[14]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[15]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Taeyoung Park,et al.  Bayesian semi-parametric analysis of Poisson change-point regression models: application to policy-making in Cali, Colombia , 2012, Journal of applied statistics.

[18]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[19]  Minoru Etoh,et al.  Topic Analysis of Web User Behavior Using LDA Model on Proxy Logs , 2011, PAKDD.

[20]  T. Minka Estimating a Dirichlet distribution , 2012 .

[21]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[22]  D. V. van Dyk,et al.  Partially Collapsed Gibbs Samplers , 2008 .

[23]  Taeyoung Park,et al.  Bayesian variable selection in Poisson change-point regression analysis , 2017, Commun. Stat. Simul. Comput..

[24]  Taeyoung Park,et al.  Efficient Bayesian analysis of multivariate aggregate choices , 2015 .

[25]  William Holderbaum,et al.  Investigating Eating Behaviours Using Topic Models , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Stanley F. Chen,et al.  Performance Prediction for Exponential Language Models , 2009, NAACL.

[28]  Chong Wang,et al.  Nested Hierarchical Dirichlet Processes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[30]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Minjae Park,et al.  Analysis of binary longitudinal data with time-varying effects , 2017, Comput. Stat. Data Anal..

[32]  Taeyoung Park,et al.  Bayesian Semiparametric Inference on Functional Relationships in Linear Mixed Models , 2016 .

[33]  Taeyoung Park,et al.  Bayesian Analysis of Individual Choice Behavior With Aggregate Data , 2011 .

[34]  Jae Won Lee,et al.  Bayesian nonparametric inference on quantile residual life function: Application to breast cancer data , 2012, Statistics in medicine.

[35]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.