Bayesian Supervised Topic Modeling with Covariates

Topic modeling is a latent variable approach to quantitatively model text. This approach assumes that text can be summarized as a mixture of latent categories (i.e., topics; Blei, Ng, & Jordan, 2003). By representing patterns of word co-occurrences as topics, underlying features of the text can be extracted and studied on their own or used as variables in subsequent analyses. Applications of topic modeling in psychological research typically estimate topics and then use them as predictors in later regression models. However, this two-stage procedure ignores variability in the topic estimates, which can produce incorrect estimation and inference. Blei and McAuliffe (2008) proposed a one-stage model, supervised topic modeling (SLDA), in which the topics predict an observed outcome. However, SLDA does not allow for the inclusion of additional predictors of the outcome, limiting its utility for psychological research. Therefore, we extended SLDA to jointly estimate a latent variable model of text and a regression model to predict an outcome using both topics and additional predictors. Our proposed model (SLDAX; Figure 1) includes two components: (1) a topic model and (2) a generalized linear regression model predicting the outcome using the topics and additional predictors. To estimate the SLDAX model, we analytically derived a Gibbs sampling algorithm which is implemented in an R package, psychtm. The model and Gibbs sampler were evaluated in a simulation study based on an SLDAX model with a continuous and a dichotomous predictor and K topics. We manipulated the number of topics, number of documents, vocabulary size, document length, effect size for the topics as predictors, and the prior covariance matrix of the regression coefficients. Results showed that the proposed Bayesian method recovered model parameters well and provided appropriate Type I error rates and standard errors. We provide sample size recommendations regarding the number of documents and document length.

[1]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..