论文信息 - Context-dependent modeling in a segment-based speech recognition system

Context-dependent modeling in a segment-based speech recognition system

The goal of this thesis is to explore various strategies for incorporating contextual information into a segment-based speech recognition system, while maintaining computational costs at a level acceptable for implementation in a real-time system. The latter is achieved by using context-independent models in the search, while contextdependent models are reserved for re-scoring the hypotheses proposed by the contextindependent system. Within this framework, several types of context-dependent sub-word units were evaluated, including word-dependent, biphone, and triphone units. In each case, deleted interpolation was used to compensate for the lack of training data for the models. Other types of context-dependent modeling, such as context-dependent boundary modeling and \o set" modeling, were also used successfully in the re-scoring pass. The evaluation of the system was performed using the Resource Management task. Context-dependent segment models were able to reduce the error rate of the context-independent system by more than twenty percent, and context-dependent boundary models were able to reduce the word error rate by more than a third. A straight-forward combination of context-dependent segment models and boundary models leads to further reductions in error rate. So that it can be incorporated easily into existing and future systems, the code for re-sorting N -best lists has been implemented as an object in Sapphire, a framework for specifying the con guration of a speech recognition system using a scripting language. It is currently being tested on Jupiter, a real-time telephone based weather information system under development here at SLS.

Benjamin M. Serridge

[1] S. Roucos,et al. The role of word-dependent coarticulatory effects in a phoneme-based speech recognition system , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] Peter E. Hart,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3] Victor Zue,et al. Modelling Context Dependency in Acoustic-Phonetic and Lexical Representations , 1991, HLT.

[4] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5] Frederick Jelinek,et al. Interpolated estimation of Markov source parameters from sparse data , 1980 .

[6] Chin-Hui Lee,et al. Acoustic modeling for large vocabulary speech recognition , 1990 .

[7] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[8] B. Juang,et al. Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[9] Kay-Fu Lee,et al. Context-dependent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[10] Steve Young,et al. Large vocabulary speech recognition , 1995 .

[11] Lalit R. Bahl,et al. Further results on the recognition of a continuously read natural corpus , 1980, ICASSP.

[12] James R. Glass,et al. A probabilistic framework for feature-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13] Mari Ostendorf,et al. A stochastic segment model for phoneme-based continuous speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[14] Kai-Fu Lee,et al. Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[15] Manish D. Muzumdar. Automatic acoustic measurement optimization for segmental speech recognition , 1996 .

[16] Andrej Ljolje,et al. High accuracy phone recognition using context clustering and quasi-triphonic models , 1994, Comput. Speech Lang..

[17] Nils J. Nilsson,et al. Artificial Intelligence , 1974, IFIP Congress.

[18] Jonathan G. Fiscus,et al. DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[19] Frank K. Soong,et al. A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[20] Frank K. Soong,et al. High performance connected digit recognition, using hidden Markov models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21] Alvin W. Drake,et al. Fundamentals of Applied Probability Theory , 1967 .

[22] Michael Picheny,et al. Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[23] Mei-Yuh Hwang,et al. Deleted interpolation and density sharing for continuous hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[24] Mei-Yuh Hwang,et al. Improved acoustic modeling with the SPHINX speech recognition system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[25] Michael K. McCandless,et al. SAPPHIRE: an extensible speech analysis and recognition tool based on Tcl/Tk , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26] Richard M. Schwartz,et al. Improved hidden Markov modeling of phonemes for continuous speech recognition , 1984, ICASSP.

[27] P.C. Woodland,et al. The 1994 HTK large vocabulary speech recognition system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[28] M. Boden. Arti cial Intelligence and Natural Man , 1977 .

[29] Chin-Hui Lee,et al. Acoustic Modeling of Subword Units for Large Vocabulary Speaker Independent Speech Recognition , 1989, HLT.

[30] Steve J. Young,et al. The HTK tied-state continuous speech recogniser , 1993, EUROSPEECH.

[31] Patti Price,et al. The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[32] Richard M. Schwartz,et al. The N-Best Algorithm: Efficient Procedure for Finding Top N Sentence Hypotheses , 1989, HLT.

[33] Victor Zue,et al. A* word network search for continuous speech recognition , 1993, EUROSPEECH.