The Use of Context in Large Vocabulary Speech Recognition

In recent years, considerable progress has been made in the eld of continuous speech recognition where the predominant technology is based on hidden Markov models (HMMs). HMMs represent sequences of time varying speech spectra using probabilistic functions of an underlying Markov chain. However, because the probability distribution represented by a HMM is very simple, its discriminative ability is limited. As a consequence, a careful choice of the units represented by each model is required in order to accurately model the variation inherent in natural speech. In practice, much of the variation is due to consistent contextual eeects and can be accounted for by using context dependent models. In large vocabulary recognition the use of context dependent models introduces two major problems. Firstly, some method must be devised to determine the set of contexts which require distinct models. Furthermore, this must be done in a way which takes account of the sparsity and unevenness of the training data. Secondly, a strategy must be devised which allows eecient decoding using models incorporating context dependencies both within words and across word boundaries. This thesis addresses both of these key problems. Firstly, a method of constructing robust and accurate recognisers using decision tree based clustering techniques is described. The strength of this approach lies in its ability to accurately model contexts not appearing in the training data. Linguistic knowledge is used, in conjunction with the data, to decide which contexts are similar and can share parameters. A key feature of this approach is that it allows the construction of models which are dependent upon contextual eeects occurring across word boundaries. The use of cross word context dependent models presents problems for conventional de-coders. The second part of the thesis therefore presents a new decoder design which is capable of using these models eeciently. The decoder is suitable for use with very large vocabularies and long span language models. It is also capable of generating a lattice of word hypotheses with little computational overhead. These lattices can be used to constrain further decoding, allowing eecient use of complex acoustic and language models. The eeectiveness of these techniques has been assessed on a variety of large vocabulary continuous speech recognition tasks and results are presented which analyse performance in terms of computational complexity and recognition accuracy. The experiments demonstrate state of the art performance and a recogniser using these techniques was used in the 1994 US …

[1]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[2]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[3]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[4]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[5]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[6]  Bruce Lowerre,et al.  The Harpy speech understanding system , 1990 .

[7]  J. Wells Accents of English: Beyond The British Isles , 1982 .

[8]  Biing-Hwang Juang,et al.  Maximum likelihood estimation for multivariate mixture observations of markov chains , 1986, IEEE Trans. Inf. Theory.

[9]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[10]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[11]  Andreas Noll,et al.  A data-driven organization of the dynamic programming beam search for continuous speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[13]  Kai-Fu Lee,et al.  On large-vocabulary speaker-independent continuous speech recognition , 1988, Speech Commun..

[14]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter models for large vocabulary isolated speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[15]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[16]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17]  Janet M. Baker,et al.  On the Interaction Between True Source, Training, and Testing Language Models , 1990, HLT.

[18]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[19]  Chin-Hui Lee,et al.  Implementation Aspects Of Large Vocabulary Recognition Based On Intraword And Interword Phonetic Units , 1990, HLT.

[20]  Douglas B. Paul,et al.  Algorithms for an Optimal A* Search and Linearizing the Search in the Stack Decoder* , 1991, HLT.

[21]  Richard M. Schwartz,et al.  Toward a Real-Time Spoken Language System Using Commercial Hardware , 1990, HLT.

[22]  Richard M. Schwartz,et al.  Efficient, High-Performance Algorithms for N-Best Search , 1990, HLT.

[23]  Michael Picheny,et al.  Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[24]  Chin-Hui Lee,et al.  Complexity reduction in a large vocabulary speech recognizer , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[25]  Aaron E. Rosenberg,et al.  Word juncture modeling using phonological rules for HMM-based continuous speech recognition , 1991 .

[26]  D. O'Shaughnessy,et al.  A*-admissible heuristics for rapid lexical access , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[27]  Hsiao-Wuen Hon,et al.  Recent Progress in Robust Vocabulary-Independent Speech Recognition , 1991, HLT.

[28]  Michael Picheny,et al.  Context Dependent Modeling of Phones in Continuous Speech Using Decision Trees , 1991, HLT.

[29]  Steve Austin,et al.  The forward-backward search algorithm , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[30]  Douglas B. Paul,et al.  An Efficient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model , 1992, HLT.

[31]  Mari Ostendorf,et al.  Context modeling with the stochastic segment model , 1992, IEEE Trans. Signal Process..

[32]  Mei-Yuh Hwang,et al.  Subphonetic Modeling for Speech Recognition , 1992, HLT.

[33]  Steve Young,et al.  The general use of tying in phoneme-based HMM speech recognisers , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Steve Young,et al.  Benchmark DARPA RM results using the HTK portable HMM toolkit , 1992 .

[35]  Chin-Hui Lee,et al.  MAP Estimation of Continuous Density HMM : Theory and Applications , 1992, HLT.

[36]  Michael Picheny,et al.  A fast match for continuous speech recognition using allophonic models , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Mei-Yuh Hwang,et al.  An improved search algorithm using incremental knowledge for continuous speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Mitch Weintraub,et al.  Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Jonathan G. Fiscus,et al.  Benchmark Tests for the DARPA Spoken Language Program , 1993, HLT.

[40]  K.F. Lee,et al.  On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition , 1993, IEEE Trans. Speech Audio Process..

[41]  Janet M. Baker,et al.  Large vocabulary continuous speech recognition of Wall Street Journal data , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Pascale Fung,et al.  The estimation of powerful language models from small and large corpora , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  Yves Normandin,et al.  Efficient lexical access strategies , 1993, EUROSPEECH.

[44]  Victor Zue,et al.  A* word network search for continuous speech recognition , 1993, EUROSPEECH.

[45]  Steve J. Young,et al.  The HTK tied-state continuous speech recogniser , 1993, EUROSPEECH.

[46]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[47]  Lori Lamel,et al.  The LIMSI continuous speech dictation system: evaluation on the ARPA Wall Street Journal task , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Hermann Ney,et al.  Large vocabulary continuous speech recognition of Wall Street Journal data , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  Mari Ostendorf,et al.  Maximum likelihood clustering of Gaussians for speech recognition , 1994, IEEE Trans. Speech Audio Process..

[50]  Hermann Ney,et al.  Improvements in beam search for 10000-word continuous-speech recognition , 1994, IEEE Trans. Speech Audio Process..

[51]  Steve J. Young,et al.  A One Pass Decoder Design For Large Vocabulary Recognition , 1994, HLT.

[52]  Harvey F. Silverman,et al.  Using MAP estimated parameters to improve HMM speech recognition performance , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[53]  Jonathan G. Fiscus,et al.  1993 Benchmark Tests for the ARPA Spoken Language Program , 1994, HLT.

[54]  Douglas B. Paul The Lincoln Large-Vocabulary Stack-Decoder Based HMM CSR , 1994, HLT.

[55]  Mitch Weintraub,et al.  The Hub and Spoke Paradigm for CSR Evaluation , 1994, HLT.

[56]  Steve Renals,et al.  IPA: improved phone modelling with recurrent neural networks , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[57]  Mei-Yuh Hwang,et al.  Improving speech recognition performance via phone-dependent VQ codebooks and adaptive language models in SPHINX-II , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[59]  Steve Young,et al.  Tree-based state clustering for large vocabulary speech recognition , 1994, Proceedings of ICSIPNN '94. International Conference on Speech, Image Processing and Neural Networks.

[60]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[61]  Philip C. Woodland,et al.  Speaker adaptation of continuous density HMMs using multivariate linear regression , 1994, ICSLP.

[62]  Steve J. Young,et al.  Large vocabulary multilingual speech recognition using HTK , 1995, EUROSPEECH.

[63]  P. Woodland,et al.  Flexible speaker adaptation using maximum likelihood linear regression , 1995 .

[64]  Philip C. Woodland,et al.  The development of the 1994 HTK large vocabulary speech recognition system , 1995 .

[65]  Alex Waibel,et al.  The Janus Speech Recognizer , 1995 .

[66]  J. L. Gauvain Developments in Large Vocabulary Dictation : The LIMSI Nov94 NAB System , 1995 .

[67]  Steve Renals,et al.  The 1994 Abbot hybrid connectionist-HMM large vocabulary recognition system. , 1995 .

[68]  Fernando Pereira,et al.  The AT&t 60,000 word speech-to-text system , 1995, EUROSPEECH.

[69]  Ronald Rosenfeld,et al.  The CMU Statistical Language Modeling Toolkit and its use in the 1994 ARPA CSR Evaluation , 1995 .

[70]  Douglas B. Paul New developments in the Lincoln stack-decoder based large-vocabulary CSR system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[71]  Mei-Yuh Hwang,et al.  Predicting unseen triphones with senones , 1996, IEEE Trans. Speech Audio Process..

[72]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .