论文信息 - N-Gram Models

N-Gram Models

Key Points In automatic speech recognition, n-grams are important to model some of the structural usage of natural language, i.e., the model uses word dependencies to assign a higher probability to ‘‘how are you today’’ than to ‘‘are how today you,’’ although both phrases contain the exact same words. If used in information retrieval, simple unigram language models (n-gram models with n 1⁄4 1), i.e., models that do not use term dependencies, result in good quality retrieval in many studies. The use of bigram models (n-gram models with n 1⁄4 2) would allow the system to model direct term dependencies, and treat the occurrence of ‘‘New York’’ differently from separate occurrences of ‘‘New’’ and ‘‘York,’’ possibly improving retrieval performance. The use of trigram models would allow the system to find direct occurrences of ‘‘New York metro,’’ etc. The following equations contain respectively (1) a unigram model, (2) a bigram model, and (3) a trigram model:

Djoerd Hiemstra

[1] W. Bruce Croft,et al. A general language model for information retrieval , 1999, CIKM '99.

[2] W. Bruce Croft,et al. A Markov random field model for term dependencies , 2005, SIGIR '05.

[3] Richard M. Schwartz,et al. A hidden Markov model information retrieval system , 1999, SIGIR '99.