论文信息 - Extracting Representative Words of a Topic Determined by Latent Dirichlet Allocation

Extracting Representative Words of a Topic Determined by Latent Dirichlet Allocation

Determining the topic of a document is necessary to understand the content of the document efficiently. Latent Dirichlet Allocation (LDA) is a method of analyzing topics. In LDA, a topic is treated as an unobservable variable to establish a probabilistic distribution of words. We can interpret the topic with a list of words that appear with high probability in the topic. This method works well when determining a topic included in many documents having a variety of contents. However, it is difficult to interpret the topic just using conventional LDA when determining the topic in a set of article abstracts found by a keyword search, because their contents are limited and similar. We propose a method to estimate representative words of each topic from an LDA result. Experimental results show that our method provides better information for interpreting a topic than LDA does. Keywords-LDA; topic analysis; Gibbs sampling.

Yoichi Tomiura | Emi Ishita | Toshiaki Funatsu | Kosuke Furusawa

[1] Dan Roth,et al. Citation Author Topic Model in Expert Search , 2010, COLING.

[2] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3] Mark Steyvers,et al. Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[4] David Buttler,et al. Latent topic feedback for information retrieval , 2011, KDD.

[5] John D. Lafferty,et al. Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.