Segmental HMMs: Modeling Dynamics and Underlying Structure in Speech

The motivation underlying the development of segmental hidden Markov models (SHMMs) is to overcome important speech-modeling limitations of conventional HMMs by representing sequences (or ‘segments’) of features and incorporating the concept of a trajectory to describe how features change over time. This paper presents an overview of investigations that have been carried out into the properties and recognition performance of various SHMMs, highlighting some of the issues that have been identified in using these models successfully. Recognition results are presented showing that the best recognition performance was obtained when combining a trajectory model with a formant representation, in comparison both with a conventional cepstrum-based HMM system and with systems that incorporated either of the developments individually.

[1]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[2]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Louis A. Liporace,et al.  Maximum likelihood estimation for multivariate observations of Markov sources , 1982, IEEE Trans. Inf. Theory.

[4]  Mark J. F. Gales,et al.  Segmental hidden Markov models , 1993, EUROSPEECH.

[5]  B. C. Dupree Formant coding of speech using dynamic programming , 1984 .

[6]  Isabel Trancoso,et al.  Improving speaker recognisability in phonetic vocoders , 1998, ICSLP.

[7]  Herbert Gish,et al.  A segmental speech model with applications to word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Philip N. Garner,et al.  Using formant frequencies in speech recognition , 1997, EUROSPEECH.

[9]  Jonathan G. Fiscus,et al.  1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures , 1998 .

[10]  Martin J. Russell,et al.  Probabilistic-trajectory segmental HMMs , 1999, Comput. Speech Lang..

[11]  Mark J. F. Gales,et al.  The theory of segmental hidden Markov models , 1993 .

[12]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[13]  Philip N. Garner,et al.  On the robust incorporation of formant features into hidden Markov models for automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[14]  George R. Doddington,et al.  A phonetic vocoder , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[15]  Mohamed Ismail,et al.  Between recognition and synthesis - 300 bits/second speech coding , 1997, EUROSPEECH.

[16]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[17]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[18]  Keiichi Tokuda,et al.  A very low bit rate speech coder using HMM-based speech recognition/synthesis techniques , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  L Deng,et al.  Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics. , 2000, The Journal of the Acoustical Society of America.

[21]  Vassilios Digalakis,et al.  Segment-based stochastic models of spectral dynamics for continuous speech recognition , 1992 .

[22]  John S. Bridle,et al.  The HDM: a segmental hidden dynamic model of coarticulation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[23]  J. N. Holmes A parallel formant synthesizer for machine voice output , 1986 .

[24]  Xiaodong Sun,et al.  Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states , 1994, IEEE Trans. Speech Audio Process..