Variable-length sequence language model for large vocabulary continuous dictation machine

In natural language, some sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modeling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present two methods for automatically determining frequent phrases in unlabeled corpora of written sentences. These methods are based on information theoretic criteria which insure a high statistical consistency. Our models reach their local optimum since they minimize the perplexity. One procedure is based only on the n-gram language model to extract word sequences. The second one is based on a class n-gram model trained on 233 classes extracted from the eight grammatical classes of French. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words extracted from the ?Le Monde? newspaper. Our models reduce perplexity by more than 20% compared with n-gram (nR3) and multigram models. In terms of recognition rate, our models outperform n-gram and multigram models.

[1]  Egidio P. Giachin,et al.  Phrase bigrams for continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  François Charpillet,et al.  An hybrid language model for a continuous dictation prototype , 1997, EUROSPEECH.

[3]  Frédéric Bimbot,et al.  Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[5]  Abdelaziz Kriouile,et al.  Automatic word recognition based on second-order hidden Markov models , 1994, IEEE Trans. Speech Audio Process..

[6]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[7]  Paolo Baggia,et al.  Language models for spontaneous speech recognition: a bootstrap method for learning phrase digrams , 1994, ICSLP.

[8]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Alexander H. Waibel,et al.  Towards better language models for spontaneous speech , 1994, ICSLP.

[10]  Jean-François Mari,et al.  MAUD : Un prototype de machine à dicter vocale , 2000 .

[11]  François Charpillet,et al.  A new algorithm for Automatic Word Classification based on an Improved Simulated Annealing Technique , 1996 .