Improving and Evaluating Topic Models and Other Models of Text

ABSTRACT An ongoing challenge in the analysis of document collections is how to summarize content in terms of a set of inferred themes that can be interpreted substantively in terms of topics. The current practice of parameterizing the themes in terms of most frequent words limits interpretability by ignoring the differential use of words across topics. Here, we show that words that are both frequent and exclusive to a theme are more effective at characterizing topical content, and we propose a regularization scheme that leads to better estimates of these quantities. We consider a supervised setting where professional editors have annotated documents to topic categories, organized into a tree, in which leaf-nodes correspond to more specific topics. Each document is annotated to multiple categories, at different levels of the tree. We introduce a hierarchical Poisson convolution model to analyze these annotated documents. A parallelized Hamiltonian Monte Carlo sampler allows the inference to scale to millions of documents. The model leverages the structure among categories defined by professional editors to infer a clear semantic description for each topic in terms of words that are both frequent and exclusive. In this supervised setting, we validate the efficacy of word frequency and exclusivity at characterizing topical content on two very large collections of documents, from Reuters and the New York Times. In an unsupervised setting, we then consider a simplified version of the model that shares the same regularization scheme with the previous model. We carry out a large randomized experiment on Amazon Mechanical Turk to demonstrate that topic summaries based on frequency and exclusivity, estimated using the proposed regularization scheme, are more interpretable than currently established frequency-based summaries, and that the proposed model produces more efficient estimates of exclusivity than the currently established models.

[1]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[2]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[5]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[6]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[7]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[8]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[9]  S. Fienberg,et al.  Whose Ideas? Whose Words? Authorship of Ronald Reagan's Radio Addresses , 2007, PS: Political Science & Politics.

[10]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[11]  Xiao-Li Meng,et al.  Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm , 1991 .

[12]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[13]  Luke Miratrix,et al.  Concise comparative summaries (CCS) of large text corpora with a human experiment , 2014, ArXiv.

[14]  Andrew McCallum,et al.  Topic models for taxonomies , 2012, JCDL '12.

[15]  Stephen E. Fienberg,et al.  Who Wrote Ronald Reagan's Radio Addresses? , 2006 .

[16]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[17]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[18]  David M. Blei,et al.  Introduction to Probabilistic Topic Models , 2010 .

[19]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[20]  Jeffrey S. Rosenthal,et al.  Optimal Proposal Distributions and Adaptive MCMC , 2011 .

[21]  John F. Canny,et al.  GaP: a factor model for discrete data , 2004, SIGIR '04.

[22]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[23]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[24]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[25]  E. Xing,et al.  A HIERARCHICAL DIRICHLET PROCESS MIXTURE MODEL FOR HAPLOTYPE RECONSTRUCTION FROM MULTI-POPULATION DATA , 2008, 0812.4648.

[26]  Quentin Pleple,et al.  Interactive Topic Modeling , 2013 .

[27]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[28]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[29]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[30]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[31]  Michael I. Jordan,et al.  Tree-Structured Stick Breaking for Hierarchical Data , 2010, NIPS.

[32]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[33]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[34]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[35]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[36]  S. Fienberg,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[37]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[38]  Timothy N. Rubin,et al.  Statistical topic models for multi-label document classification , 2011, Machine Learning.

[39]  Stephen E Fienberg,et al.  Reconceptualizing the classification of PNAS articles , 2010, Proceedings of the National Academy of Sciences.

[40]  Jun S. Liu,et al.  Parameter Expansion for Data Augmentation , 1999 .

[41]  Edoardo M. Airoldi,et al.  Handbook of Mixed Membership Models and Their Applications , 2014 .

[42]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[43]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[44]  Aleks Jakulin,et al.  Discrete Component Analysis , 2005, SLSFS.

[45]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[46]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models , 2012, J. Mach. Learn. Res..