Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three elements for assessing diversity: words, topics, and documents as collections of words. Topic models play a central role in this approach. Using standard topic models for measuring diversity of documents is suboptimal due to generality and impurity. General topics only include common information from a background corpus and are assigned to most of the documents in the collection. Impure topics contain words that are not related to the topic; impurity lowers the interpretability of topic models and impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation approach for topic models to combat generality and impurity; the proposed approach operates at three levels: words, topics, and documents. Our re-estimation approach for measuring documents’ topical diversity outperforms the state of the art on PubMed dataset which is commonly used for diversity experiments.

[1]  A. Solow,et al.  On the measurement of biological diversity , 1993 .

[2]  Dat Quoc Nguyen,et al.  Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[3]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[4]  David J. Miller,et al.  Parsimonious Topic Models with Salient Word Discovery , 2014, IEEE Transactions on Knowledge and Data Engineering.

[5]  Hong Cheng,et al.  The dual-sparse topic model: mining focused topics and focused terms in short text , 2014, WWW.

[6]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[7]  Maarten Marx,et al.  Are Topically Diverse Documents Also Interesting? , 2015, CLEF.

[8]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[9]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[10]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[11]  Padhraic Smyth,et al.  Text-based measures of document diversity , 2013, KDD.

[12]  Chong Wang,et al.  Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[13]  Calyampudi R. Rao Diversity and dissimilarity coefficients: A unified approach☆ , 1982 .

[14]  Maarten Marx,et al.  On Horizontal and Vertical Separation in Hierarchical Text Classification , 2016, ICTIR.

[15]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Maarten Marx,et al.  Two-Way Parsimonious Classification Models for Evolving Hierarchies , 2016, CLEF.

[18]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[19]  Edoardo M. Airoldi,et al.  Jordan Boyd-Graber, David Mimno, and David Newman. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. Handbook of Mixed Membership Models and Their Applications, 2014. , 2014 .

[20]  A. U.S.,et al.  An Information Theoretic Approach to Quantifying Text Interestingness , 2014 .

[21]  Chong Wang,et al.  The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling , 2010, ICML.

[22]  Djoerd Hiemstra,et al.  Parsimonious language models for information retrieval , 2004, SIGIR '04.

[23]  David M. Mimno,et al.  Care and Feeding of Topic Models , 2014, Handbook of Mixed Membership Models and Their Applications.

[24]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[25]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[26]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.