Beyond the Conventional Statistical Language Models: the Variable-length Sequences Approach

In natural language, several sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modelling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present an original method for automatically determining the most important phrases in corpora. This method is based on information theoretic criteria, which insure a high statistical consistency, and on French grammatical classes which include additional type of linguistic dependencies. In addition, the perplexity is used in order to make the decision of selecting a potential sequence more accurate. We propose also several variants of language models with and without word sequences. Among them, we present a model in which the trigger pairs are more significant linguistically. The originality of this model, compared with the commonly used trigger approaches, is the use of word sequences to estimate the trigger pair without limiting itself to single words. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words. The use of word sequences proposed by our algorithm reduces perplexity by more than 16% compared to those, which are limited to single words. The introduction of these word sequences in our dictation machine improves the accuracy by approximately 15%.

[1]  Egidio P. Giachin,et al.  Phrase bigrams for continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  François Charpillet,et al.  An hybrid language model for a continuous dictation prototype , 1997, EUROSPEECH.

[3]  Lori Lamel,et al.  Text normalization and speech recognition in French , 1997, EUROSPEECH.

[4]  Michèle Jardino,et al.  Language modeling based on automatic word concatenations , 1999, EUROSPEECH.

[5]  François Charpillet,et al.  A new algorithm for Automatic Word Classification based on an Improved Simulated Annealing Technique , 1996 .

[6]  Alexander H. Waibel,et al.  Class phrase models for language modeling , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Jean-François Mari,et al.  MAUD : Un prototype de machine à dicter vocale , 2000 .

[8]  Jean-François Mari,et al.  Variable-length sequence language model for large vocabulary continuous dictation machine , 1999, EUROSPEECH.

[9]  Abdelaziz Kriouile,et al.  Automatic word recognition based on second-order hidden Markov models , 1994, IEEE Trans. Speech Audio Process..

[10]  Maxine Eskénazi,et al.  BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[11]  Mary O'Kane,et al.  Language modeling of spontaneous speech in a court context , 1995, EUROSPEECH.

[12]  Hermann Ney,et al.  Selection criteria for word trigger pairs in language modelling , 1996, ICGI.

[13]  Frédéric Bimbot,et al.  Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[15]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[16]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[17]  Alexander H. Waibel,et al.  Towards better language models for spontaneous speech , 1994, ICSLP.