论文信息 - A Note on Topical N-grams

A Note on Topical N-grams

Abstract : Most of the popular topic models (such as Latent Dirichlet Allocation) have an underlying assumption: bag of words. However, text is indeed a sequence of discrete word tokens, and without considering the order of words (in another word, the nearby context where a word is located), the accurate meaning of language cannot be exactly captured by word co-occurrences only. In this sense, collocations of words (phrases) have to be considered. However, like individual words, phrases sometimes show polysemy as well depending on the context. More noticeably, a composition of two (or more) words is a phrase in some contexts, but not in other contexts. In this paper, the authors propose a new probabilistic generative model that automatically determines unigram words and phrases based on context and simultaneously associates them with a mixture of topics. They present very interesting results on large text corpora.

Andrew McCallum | Xuerui Wang | A. McCallum | Xuerui Wang

[1] Thomas L. Griffiths,et al. Integrating Topics and Syntax , 2004, NIPS.

[2] Slava M. Katz,et al. Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[3] Hanna M. Wallach,et al. Topic modeling: beyond bag-of-words , 2006, ICML.

[4] Mary Hart,et al. Automatic indexing using selective NLP and first-order thesauri , 1991, RIAO.

[5] Kenneth Ward Church,et al. Using Statistics in Lexical Analysis , 2003, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.

[6] Julia E. Hodges,et al. An automated system that assists in the generation of document indexes , 1996, Nat. Lang. Eng..

[7] Tomek Strzalkowski. Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[8] Joel L. Fagan. The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval , 1989 .

[9] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[10] David J. C. MacKay,et al. A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[11] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[12] SmadjaFrank. Retrieving collocations from text , 1993 .

[13] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14] Joel L. Fagan,et al. The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval , 1989, JASIS.

[15] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[16] Frank Smadja,et al. Retrieving Collocations from Text: Xtract , 1993, CL.

[17] W. Bruce Croft,et al. LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[18] Claire Cardie,et al. An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.