Context-dependent modeling in a segment-based speech recognition system

The goal of this thesis is to explore various strategies for incorporating contextual information into a segment-based speech recognition system, while maintaining computational costs at a level acceptable for implementation in a real-time system. The latter is achieved by using context-independent models in the search, while contextdependent models are reserved for re-scoring the hypotheses proposed by the contextindependent system. Within this framework, several types of context-dependent sub-word units were evaluated, including word-dependent, biphone, and triphone units. In each case, deleted interpolation was used to compensate for the lack of training data for the models. Other types of context-dependent modeling, such as context-dependent boundary modeling and \o set" modeling, were also used successfully in the re-scoring pass. The evaluation of the system was performed using the Resource Management task. Context-dependent segment models were able to reduce the error rate of the context-independent system by more than twenty percent, and context-dependent boundary models were able to reduce the word error rate by more than a third. A straight-forward combination of context-dependent segment models and boundary models leads to further reductions in error rate. So that it can be incorporated easily into existing and future systems, the code for re-sorting N -best lists has been implemented as an object in Sapphire, a framework for specifying the con guration of a speech recognition system using a scripting language. It is currently being tested on Jupiter, a real-time telephone based weather information system under development here at SLS.

[1]  S. Roucos,et al.  The role of word-dependent coarticulatory effects in a phoneme-based speech recognition system , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Victor Zue,et al.  Modelling Context Dependency in Acoustic-Phonetic and Lexical Representations , 1991, HLT.

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[6]  Chin-Hui Lee,et al.  Acoustic modeling for large vocabulary speech recognition , 1990 .

[7]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[8]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[9]  Kay-Fu Lee,et al.  Context-dependent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[10]  Steve Young,et al.  Large vocabulary speech recognition , 1995 .

[11]  Lalit R. Bahl,et al.  Further results on the recognition of a continuously read natural corpus , 1980, ICASSP.

[12]  James R. Glass,et al.  A probabilistic framework for feature-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  Mari Ostendorf,et al.  A stochastic segment model for phoneme-based continuous speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[14]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[15]  Manish D. Muzumdar Automatic acoustic measurement optimization for segmental speech recognition , 1996 .

[16]  Andrej Ljolje,et al.  High accuracy phone recognition using context clustering and quasi-triphonic models , 1994, Comput. Speech Lang..

[17]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[18]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[19]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[20]  Frank K. Soong,et al.  High performance connected digit recognition, using hidden Markov models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21]  Alvin W. Drake,et al.  Fundamentals of Applied Probability Theory , 1967 .

[22]  Michael Picheny,et al.  Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[23]  Mei-Yuh Hwang,et al.  Deleted interpolation and density sharing for continuous hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[24]  Mei-Yuh Hwang,et al.  Improved acoustic modeling with the SPHINX speech recognition system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[25]  Michael K. McCandless,et al.  SAPPHIRE: an extensible speech analysis and recognition tool based on Tcl/Tk , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26]  Richard M. Schwartz,et al.  Improved hidden Markov modeling of phonemes for continuous speech recognition , 1984, ICASSP.

[27]  P.C. Woodland,et al.  The 1994 HTK large vocabulary speech recognition system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[28]  M. Boden Arti cial Intelligence and Natural Man , 1977 .

[29]  Chin-Hui Lee,et al.  Acoustic Modeling of Subword Units for Large Vocabulary Speaker Independent Speech Recognition , 1989, HLT.

[30]  Steve J. Young,et al.  The HTK tied-state continuous speech recogniser , 1993, EUROSPEECH.

[31]  Patti Price,et al.  The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[32]  Richard M. Schwartz,et al.  The N-Best Algorithm: Efficient Procedure for Finding Top N Sentence Hypotheses , 1989, HLT.

[33]  Victor Zue,et al.  A* word network search for continuous speech recognition , 1993, EUROSPEECH.