Multi-resolution phonetic/segmental features and models for HMM-based speech recognition

This paper explores the modelling of phonetic segments of speech with multi-resolution spectral/time correlates. For spectral representation a set of multi-resolution cepstral features are proposed. Cepstral features obtained from a DCT of the log energy-spectrum over the full voice-bandwidth (100-4000 Hz) are combined with higher resolution features obtained from the DCT of upper subband (say 100-2100) and lower subband (2100-4000) halves. This approach can be extended to several levels of different resolutions. For representation of the temporal structure of speech segments or phonetic units, the conventional cepstral and dynamic cepstral features representing speech at the sub-phonetic levels, are supplemented by a set of phonetic features that describe the trajectory of speech over the duration of a phonetic unit. A conditional probability model for phonetic and sub-phonetic features is considered. Experiments demonstrate that the inclusion of the segmental features result in about 10% decrease in error rates.

[1]  Martin J. Russell,et al.  Speech recognition using a linear dynamic segmental HMM , 1995, EUROSPEECH.

[2]  Mari Ostendorf,et al.  A stochastic segment model for phoneme-based continuous speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[3]  Mark J. F. Gales,et al.  Segmental hidden Markov models , 1993, EUROSPEECH.

[4]  Kuldip K. Paliwal,et al.  Design of a speech recognition system based on acoustically derived segmental units , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Rathinavelu Chengalvarayan,et al.  HMM-based speech recognition using state-dependent, linear transforms on Mel-warped DFT features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  Ben P. Milner,et al.  Inclusion of temporal information into features for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.