A Word Position-Related LDA Model

LDA (Latent Dirichlet Allocation) proposed by Blei is a generative probabilistic model of a corpus, where documents are represented as random mixtures over latent topics, and each topic is characterized by a distribution over words, but not the attributes of word positions of every document in the corpus. In this paper, a Word Position-Related LDA Model is proposed taking into account the attributes of word positions of every document in the corpus, where each word is characterized by a distribution over word positions. At the same time, the precision of the topic-word's interpretability is improved by integrating the distribution of the word-position and the appropriate word degree, taking into account the different word degree in the different word positions. Finally, a new method, a size-aware word intrusion method is proposed to improve the ability of the topic-word's interpretability. Experimental results on the NIPS corpus show that the Word Position-Related LDA Model can improve the precision of the topic-word's interpretability. And the average improvement of the precision in the topic-word's interpretability is about 9.67%. Also, the size-aware word intrusion method can interpret the topic-word's semantic information more comprehensively and more effectively through comparing the different experimental data.

[1]  Padhraic Smyth,et al.  Subject metadata enrichment using statistical topic models , 2007, JCDL '07.

[2]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[3]  George Saon,et al.  High Performance Unconstrained Word Recognition System Combining HMMs and Markov Random Fields , 1997, Int. J. Pattern Recognit. Artif. Intell..

[4]  Gabriella Vigliocco,et al.  Learning Semantic Representations with Hidden Markov Topics Models , 2009 .

[5]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[6]  David M. Blei,et al.  Syntactic Topic Models , 2008, NIPS.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Harun Uguz,et al.  A new approach based on a discrete hidden Markov model using the Rocchio algorithm for the diagnosis of heart valve diseases , 2008, Expert Syst. J. Knowl. Eng..

[9]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[10]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[11]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[13]  Zhi-Qiang Liu Bayesian Paradigms in Image Processing , 1997, Int. J. Pattern Recognit. Artif. Intell..

[14]  Padhraic Smyth,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2006, NIPS.

[15]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.