Modeling of term-distance and term-occurrence information for improving n-gram language model performance

In this paper, we explore the use of distance and co-occurrence information of word-pairs for language modeling. We attempt to extract this information from history-contexts of up to ten words in size, and found it complements well the n-gram model, which inherently suffers from data scarcity in learning long history-contexts. Evaluated on the WSJ corpus, bigram and trigram model perplexity were reduced up to 23.5% and 14.0%, respectively. Compared to the distant bigram, we show that word-pairs can be more effectively modeled in terms of both distance and occurrence.

[1]  Noah Coccaro,et al.  Latent semantic analysis as a tool to improve automatic speech recognition performance , 2005 .

[2]  Mei-Yuh Hwang,et al.  The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[3]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Kamel Smaïli,et al.  Improving language models by using distant information , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[5]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Dietrich Klakow,et al.  Log-linear interpolation of language models , 1998, ICSLP.

[7]  Frederick Jelinek,et al.  Improved clustering techniques for class-based statistical language modeling , 1999 .

[8]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[9]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[10]  Guodong Zhou,et al.  Word Association and MI-TRigger-based Language Modeling , 1998, COLING-ACL.

[11]  Mari Ostendorf,et al.  Variable n-grams and extensions for conversational speech language modeling , 2000, IEEE Trans. Speech Audio Process..

[12]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[13]  Hermann Ney,et al.  Distant bigram language modelling using maximum entropy , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Frederick Jelinek,et al.  Structured language modeling , 2000, Comput. Speech Lang..

[16]  Jerome R. Bellegarda,et al.  A multispan language modeling framework for large vocabulary speech recognition , 1998, IEEE Trans. Speech Audio Process..

[17]  ChengXiang Zhai,et al.  Positional language models for information retrieval , 2009, SIGIR.

[18]  Yorick Wilks,et al.  A Closer Look at Skip-gram Modelling , 2006, LREC.

[19]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[20]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[21]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.