论文信息 - Visualizing Topics with Multi-Word Expressions

Visualizing Topics with Multi-Word Expressions

We describe a new method for visualizing topics, the distributions over terms that are automatically extracted from large text corpora using latent variable models. Our method finds significant $n$-grams related to a topic, which are then used to help understand and interpret the underlying distribution. Compared with the usual visualization, which simply lists the most probable topical terms, the multi-word expressions provide a better intuitive impression for what a topic is "about." Our approach is based on a language model of arbitrary length expressions, for which we develop a new methodology based on nested permutation tests to find significant phrases. We show that this method outperforms the more standard use of $\chi^2$ and likelihood ratio tests. We illustrate the topic presentations on corpora of scientific abstracts and news articles.

John D. Lafferty | David M. Blei | J. Lafferty | D. Blei

[1] E. Pitman. Significance Tests Which May be Applied to Samples from Any Populations , 1937 .

[2] E. Pitman. SIGNIFICANCE TESTS WHICH MAY BE APPLIED TO SAMPLES FROM ANY POPULATIONS III. THE ANALYSIS OF VARIANCE TEST , 1938 .

[3] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[4] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[5] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[6] P. Good,et al. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[7] Ted Pedersen,et al. Significant Lexical Relationships , 1996, AAAI/IAAI, Vol. 1.

[8] Ronald Rosenfeld,et al. Scalable backoff language models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9] Thomas Hofmann,et al. Probabilistic Latent Semantic Analysis , 1999, UAI.

[10] Andreas Stolcke,et al. Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[11] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[12] Tom Minka,et al. Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[13] Ata Kabán,et al. Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles , 2003, NIPS.

[14] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15] Robert C. Moore. On Log-Likelihood-Ratios and the Significance of Rare Events , 2004, EMNLP.

[16] Thomas L. Griffiths,et al. The Author-Topic Model for Authors and Documents , 2004, UAI.

[17] Thomas L. Griffiths,et al. Interpolating between types and tokens by estimating power-law generators , 2005, NIPS.

[18] Andrew McCallum,et al. Topic and Role Discovery in Social Networks , 2005, IJCAI.

[19] Hanna M. Wallach,et al. Topic modeling: beyond bag-of-words , 2006, ICML.

[20] Yee Whye Teh,et al. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[21] J. Pitman. Combinatorial Stochastic Processes , 2006 .

[22] Thomas L. Griffiths,et al. Unsupervised Topic Modelling for Multi-Party Spoken Discourse , 2006, ACL.

[23] A. McCallum,et al. Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[24] Andrew McCallum,et al. Organizing the OCA: learning faceted subjects from a library of digital books , 2007, JCDL '07.

[25] Andrew McCallum,et al. Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.