A Note on Topical N-grams

Abstract : Most of the popular topic models (such as Latent Dirichlet Allocation) have an underlying assumption: bag of words. However, text is indeed a sequence of discrete word tokens, and without considering the order of words (in another word, the nearby context where a word is located), the accurate meaning of language cannot be exactly captured by word co-occurrences only. In this sense, collocations of words (phrases) have to be considered. However, like individual words, phrases sometimes show polysemy as well depending on the context. More noticeably, a composition of two (or more) words is a phrase in some contexts, but not in other contexts. In this paper, the authors propose a new probabilistic generative model that automatically determines unigram words and phrases based on context and simultaneously associates them with a mixture of topics. They present very interesting results on large text corpora.

[1]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[2]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[3]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[4]  Mary Hart,et al.  Automatic indexing using selective NLP and first-order thesauri , 1991, RIAO.

[5]  Kenneth Ward Church,et al.  Using Statistics in Lexical Analysis , 2003, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.

[6]  Julia E. Hodges,et al.  An automated system that assists in the generation of document indexes , 1996, Nat. Lang. Eng..

[7]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[8]  Joel L. Fagan The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval , 1989 .

[9]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[10]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[11]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[12]  SmadjaFrank Retrieving collocations from text , 1993 .

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Joel L. Fagan,et al.  The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval , 1989, JASIS.

[15]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[16]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[17]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[18]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.