论文信息 - Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity - 字舞流文

Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three elements for assessing diversity: words, topics, and documents as collections of words. Topic models play a central role in this approach. Using standard topic models for measuring diversity of documents is suboptimal due to generality and impurity. General topics only include common information from a background corpus and are assigned to most of the documents in the collection. Impure topics contain words that are not related to the topic; impurity lowers the interpretability of topic models and impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation approach for topic models to combat generality and impurity; the proposed approach operates at three levels: words, topics, and documents. Our re-estimation approach for measuring documents’ topical diversity outperforms the state of the art on PubMed dataset which is commonly used for diversity experiments.

M. de Rijke | Maarten de Rijke | Maarten Marx | Jaap Kamps | Mostafa Dehghani | Hosein Azarbonyad | Tom Kenter | J. Kamps | maarten marx | M. Dehghani | Tom Kenter | H. Azarbonyad | Mostafa Dehghani

[1] A. Solow,et al. On the measurement of biological diversity , 1993 .

[2] Dat Quoc Nguyen,et al. Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[3] Michael I. Jordan,et al. DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[4] David J. Miller,et al. Parsimonious Topic Models with Salient Word Discovery , 2014, IEEE Transactions on Knowledge and Data Engineering.

[5] Hong Cheng,et al. The dual-sparse topic model: mining focused topics and focused terms in short text , 2014, WWW.

[6] Timothy Baldwin,et al. Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[7] Maarten Marx,et al. Are Topically Diverse Documents Also Interesting? , 2015, CLEF.

[8] Andrew McCallum,et al. Rethinking LDA: Why Priors Matter , 2009, NIPS.

[9] Yiming Yang,et al. RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[10] Michael Röder,et al. Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[11] Padhraic Smyth,et al. Text-based measures of document diversity , 2013, KDD.

[12] Chong Wang,et al. Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[13] Calyampudi R. Rao. Diversity and dissimilarity coefficients: A unified approach☆ , 1982 .

[14] Maarten Marx,et al. On Horizontal and Vertical Separation in Hierarchical Text Classification , 2016, ICTIR.

[15] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[16] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17] Maarten Marx,et al. Two-Way Parsimonious Classification Models for Evolving Hierarchies , 2016, CLEF.

[18] Scott Sanner,et al. Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[19] Edoardo M. Airoldi,et al. Jordan Boyd-Graber, David Mimno, and David Newman. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. Handbook of Mixed Membership Models and Their Applications, 2014. , 2014 .

[20] A. U.S.,et al. An Information Theoretic Approach to Quantifying Text Interestingness , 2014 .

[21] Chong Wang,et al. The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling , 2010, ICML.

[22] Djoerd Hiemstra,et al. Parsimonious language models for information retrieval , 2004, SIGIR '04.

[23] David M. Mimno,et al. Care and Feeding of Topic Models , 2014, Handbook of Mixed Membership Models and Their Applications.

[24] Pengtao Xie,et al. Integrating Document Clustering and Topic Modeling , 2013, UAI.

[25] Xiaohui Yan,et al. A biterm topic model for short texts , 2013, WWW.

[26] John D. Lafferty,et al. Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.