An efficient search space representation for large vocabulary continuous speech recognition

Abstract In pursuance of better performance, current speech recognition systems tend to use more and more complicated models for both the acoustic and the language component. Cross-word context dependent (CD) phone models and long-span statistical language models (LMs) are now widely used. In this paper, we present a memory-efficient search topology that enables the use of such detailed acoustic and language models in a one pass time-synchronous recognition system. Characteristic of our approach is (1) the decoupling of the two basic knowledge sources, namely pronunciation information and LM information, and (2) the representation of pronunciation information – the lexicon in terms of CD units – by means of a compact static network. The LM information is incorporated into the search at run-time by means of a slightly modified token-passing algorithm. The decoupling of the LM and lexicon allows great flexibility in the choice of LMs, while the static lexicon representation avoids the cost of dynamic tree expansion and facilitates the integration of additional pronunciation information such as assimilation rules. Moreover, the network representation results in a compact structure when words have various pronunciations, and due to its construction, it offers partial LM forwarding at no extra cost.

[1]  Fei Xie,et al.  A comparative study of speech detection methods , 1997, EUROSPEECH.

[2]  F. Fabide A UNIFIED SYNTAX DIRECTION MECHANISM FOR AUTOMATIC SPEECH RECOGNITION SYSTEMS USING HIDDEN MARKOV MODELS , 1989 .

[3]  Dirk Van Compernolle,et al.  A static lexicon network representation for cross-word context dependent phones , 1997, EUROSPEECH.

[4]  Mehryar Mohri,et al.  Weighted determinization and minimization for large vocabulary speech recognition , 1997, EUROSPEECH.

[5]  Tools for Development, Test and Analysis of ASRs. , 1992 .

[6]  Patrick Wambacq,et al.  Improved feature decorrelation for HMM-based speech recognition , 1998, ICSLP.

[7]  Mitch Weintraub,et al.  Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  H. Ney,et al.  Improvements in beam search for 10000-word continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Hermann Ney,et al.  Word graphs: an efficient interface between continuous-speech recognition and language understanding , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Hermann Ney,et al.  Improvements in beam search , 1994, ICSLP.

[11]  Dirk Van Compernolle,et al.  Fast and accurate acoustic modelling with semi-continuous HMMs , 1998, Speech Commun..

[12]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[13]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[14]  Peter Beyerlein,et al.  Modelling and decoding of crossword context dependent phones in the Philips large vocabulary continuous speech recognition system , 1997, EUROSPEECH.

[15]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[16]  Fernando Pereira,et al.  Transducer composition for context-dependent network expansion , 1997, EUROSPEECH.

[17]  Lalit R. Bahl,et al.  A fast approximate acoustic match for large vocabulary speech recognition , 1989, IEEE Trans. Speech Audio Process..

[18]  Sadaoki Furui,et al.  An efficient search method for large-vocabulary continuous-speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Francis Kubala,et al.  New uses for the N-Best sentence hypotheses within the BYBLOS speech recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Andrej Ljolje,et al.  Full expansion of context-dependent networks in large vocabulary speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[21]  Giuliano Antoniol,et al.  Language modelling for efficient beam-search , 1995, Comput. Speech Lang..

[22]  Pascale Fung,et al.  The estimation of powerful language models from small and large corpora , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Dirk Van Compernolle,et al.  Reduced semi-continuous models for large vocabulary continuous speech recognition in Dutch , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[24]  Richard M. Schwartz,et al.  Toward a Real-Time Spoken Language System Using Commercial Hardware , 1990, HLT.