Statistical language modeling based on variable-length sequences

Abstract In natural language and especially in spontaneous speech, people often group words in order to constitute phrases which become usual expressions. This is due to phonological (to make the pronunciation easier), or to semantic reasons (to remember more easily a phrase by assigning a meaning to a block of words). Classical language models do not adequately take into account such phrases. A better approach consists in modeling some word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the vocabulary, on which language models are computed. In this paper, we present a method for automatically retrieving the most relevant phrases from a corpus of written sentences. The originality of our approach resides in the fact that the extracted phrases are obtained from a linguistically tagged corpus. Therefore, the obtained phrases are linguistically viable. To measure the contribution of classes in retrieving phrases, we have implemented the same algorithm without using classes. The class-based method outperformed by 11% the other method. Our approach uses information theoretic criteria which insure a high statistical consistency and make the decision of selecting a potential sequence optimal in accordance with the language perplexity. We propose several variants of language model with and without word sequences. Among them, we present a model in which the trigger pairs are linguistically more significant. We show that the use of sequences decrease the word error rate and improve the normalized perplexity. For instance, the best sequence model improves the perplexity by 16%, and the the accuracy of our dictation system (MAUD) by approximately 14%. Experiments, in terms of perplexity and recognition rate, have been carried out on a vocabulary of 20,000 words extracted from a corpus of 43 million words made up of two years of the French newspaper Le Monde . The acoustic model (HMM) is trained with the Bref80 corpus.

[1]  Jean-François Mari,et al.  Variable-length sequence language model for large vocabulary continuous dictation machine , 1999, EUROSPEECH.

[2]  François Charpillet,et al.  An hybrid language model for a continuous dictation prototype , 1997, EUROSPEECH.

[3]  Egidio P. Giachin,et al.  Phrase bigrams for continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Lori Lamel,et al.  Text normalization and speech recognition in French , 1997, EUROSPEECH.

[5]  Roger K. Moore Computer Speech and Language , 1986 .

[6]  Hermann Ney,et al.  Selection criteria for word trigger pairs in language modelling , 1996, ICGI.

[7]  Frédéric Bimbot,et al.  Inference of variable-length linguistic and acoustic units by multigrams , 1997, Speech Commun..

[8]  Alex Waibel,et al.  Readings in speech recognition , 1990 .

[9]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[10]  Mary O'Kane,et al.  Language modeling of spontaneous speech in a court context , 1995, EUROSPEECH.

[11]  Alexander H. Waibel,et al.  Towards better language models for spontaneous speech , 1994, ICSLP.

[12]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[13]  Abdelaziz Kriouile,et al.  Automatic word recognition based on second-order hidden Markov models , 1994, IEEE Trans. Speech Audio Process..

[14]  Maxine Eskénazi,et al.  BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[15]  Paolo Baggia,et al.  Language models for spontaneous speech recognition: a bootstrap method for learning phrase digrams , 1994, ICSLP.

[16]  Ronald Rosenfeld,et al.  Scalable Trigram Backoff Language Models , 1996 .

[17]  M. El-Beze,et al.  Three different probabilistic language models: comparison and combination , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[18]  Michèle Jardino,et al.  Language modeling based on automatic word concatenations , 1999, EUROSPEECH.

[19]  François Charpillet,et al.  A new algorithm for Automatic Word Classification based on an Improved Simulated Annealing Technique , 1996 .

[20]  Alexander H. Waibel,et al.  Class phrase models for language modeling , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[21]  Jean-François Mari,et al.  MAUD : Un prototype de machine à dicter vocale , 2000 .

[22]  Frédéric Bimbot,et al.  Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[23]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[24]  Jonathan G. Fiscus,et al.  1993 Benchmark Tests for the ARPA Spoken Language Program , 1994, HLT.

[25]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..