A word graph algorithm for large vocabulary continuous speech recognition

Abstract This paper describes a method for the construction of a word graph (or lattice) for large vocabulary, continuous speech recognition. The advantage of a word graph is that a fairly good degree of decoupling between acoustic recognition at the 10-ms level and the final search at the word level using a complicated language model can be achieved. The word graph algorithm is obtained as an extension of the one-pass beam search strategy using word dependent copies of the word models or lexical trees. The method has been tested successfully on the 20 000-word NAB'94 task (American English, continuous speech, 20 000 words, speaker independent) and compared with the integrated method. The experiments show that the word graph density can be reduced to an average number of about 10 word hypotheses, i.e. word edges in the graph, per spoken word with virtually no loss in recognition performance.

[1]  T. K. Vintsyuk Element-wise recognition of continuous speech composed of words from a specified dictionary , 1971, CYBERNETICS.

[2]  L. Mondshein,et al.  The CASPERS linguistic analysis system , 1975 .

[3]  H. Sakoe,et al.  Two-level DP-matching--A dynamic programming-based pattern matching algorithm for connected word recognition , 1979 .

[4]  Bruce Lowerre,et al.  The Harpy speech understanding system , 1990 .

[5]  Michael D. Brown,et al.  An algorithm for connected word recognition , 1982, ICASSP.

[6]  Andreas Noll,et al.  A data-driven organization of the dynamic programming beam search for continuous speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  V. Rich Personal communication , 1989, Nature.

[8]  James K. Baker,et al.  Stochastic modeling for automatic speech understanding , 1990 .

[9]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  R. Schwartz,et al.  A comparison of several approximate algorithms for finding multiple (N-best) sentence hypotheses , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[11]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[12]  Volker Steinbiss A search organization for large-vocabulary recognition based on n-best decoding , 1991, EUROSPEECH.

[13]  Aaron E. Rosenberg,et al.  Improved acoustic modeling for large vocabulary continuous speech recognition , 1992 .

[14]  H. Ney,et al.  Improvements in beam search for 10000-word continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Renato De Mori,et al.  A cache based natural lan-guage model for speech recognition , 1992 .

[16]  Hermann Ney,et al.  Data driven search organization for continuous speech recognition , 1992, IEEE Trans. Signal Process..

[17]  Renato De Mori,et al.  High performance connected digit recognition using codebook exponents , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Hermann Ney,et al.  Word graphs: an efficient interface between continuous-speech recognition and language understanding , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Mei-Yuh Hwang,et al.  An improved search algorithm using incremental knowledge for continuous speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Mitch Weintraub,et al.  Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  George Zavaliagkos,et al.  Comparative Experiments on Large Vocabulary Speech Recognition , 1993, HLT.

[22]  Pietro Laface,et al.  Using grammars in forward and backward search , 1993, EUROSPEECH.

[23]  Janet M. Baker,et al.  Large vocabulary continuous speech recognition of Wall Street Journal data , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Enrico Bocchieri,et al.  Vector quantization for the efficient computation of continuous density likelihoods , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Lori Lamel,et al.  The LIMSI continuous speech dictation system: evaluation on the ARPA Wall Street Journal task , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Hermann Ney,et al.  Large vocabulary continuous speech recognition of Wall Street Journal data , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Hermann Ney,et al.  Improvements in beam search for 10000-word continuous-speech recognition , 1994, IEEE Trans. Speech Audio Process..

[28]  Hermann Ney,et al.  Improvements in beam search , 1994, ICSLP.

[29]  Steve J. Young,et al.  A One Pass Decoder Design For Large Vocabulary Recognition , 1994, HLT.

[30]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Hermann Ney,et al.  Extensions of absolute discounting for language modeling , 1995, EUROSPEECH.

[32]  Hermann Ney,et al.  Experimental analysis of the search space for 20 000-word speech recognition , 1995, EUROSPEECH.

[33]  Hermann Ney,et al.  Search Strategies For Large-Vocabulary Continuous-Speech Recognition , 1995 .

[34]  Peter Beyerlein,et al.  Hamming distance approximation for a fast log-likelihood computation for mixture densities , 1995, EUROSPEECH.

[35]  Steve Renals,et al.  Efficient search using posterior phone probability estimates , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[36]  Fernando Pereira,et al.  The AT&t 60,000 word speech-to-text system , 1995, EUROSPEECH.

[37]  Giuliano Antoniol,et al.  Language model representations for beam-search decoding , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[38]  Ronald Rosenfeld,et al.  The CMU Statistical Language Modeling Toolkit and its use in the 1994 ARPA CSR Evaluation , 1995 .

[39]  Hermann Ney,et al.  Large vocabulary continuous speech recognition using word graphs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[40]  Hermann Ney,et al.  A comparison of time conditioned and word conditioned search techniques for large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[41]  Hermann Ney,et al.  Language-model look-ahead for large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[42]  Mei-Yuh Hwang,et al.  Improvements on the pronunciation prefix tree search organization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.