HMM-based speech recognition using state-dependent, discriminatively derived transforms on mel-warped DFT features

In the study reported in this paper, we investigate interactions of front-end feature extraction and back-end classification techniques in hidden Markov model-based (HMM-based) speech recognition. The proposed model focuses on dimensionality reduction of the mel-warped discrete Fourier transform (DFT) feature space subject to maximal preservation of speech classification information, and aims at finding an optimal linear transformation on the mel-warped DFT according to the minimum classification error (MCE) criterion. This linear transformation, along with the HMM parameters, are automatically trained using the gradient descent method to minimize a measure of overall empirical error counts. A further generalization of the model allows integration of the discriminatively derived state-dependent transformation with the construction of dynamic feature parameters. Experimental results show that state-dependent transformation on mel-warped DFT features is superior in performance to the mel-frequency cepstral coefficients (MFCC's). An error rate reduction of 15% is obtained on a standard 39-class TIMIT phone classification task, in comparison with the conventional MCE-trained HMM using MFCC's that have not been subject to optimization during training.

[1]  Louis A. Liporace,et al.  Maximum likelihood estimation for multivariate observations of Markov sources , 1982, IEEE Trans. Inf. Theory.

[2]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[3]  George R. Doddington,et al.  Frame-specific statistical features for speaker independent speech recognition , 1986, IEEE Trans. Acoust. Speech Signal Process..

[4]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[5]  Brian Hanson,et al.  Robust speaker-independent word recognition using static, dynamic and acceleration features: experiments with Lombard and noisy speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[6]  Patrick Kenny,et al.  Phonemic hidden Markov models with continuous mixture output densities for large vocabulary word recognition , 1991, IEEE Trans. Signal Process..

[7]  Philip C. Woodland,et al.  Optimising hidden Markov models using discriminative output distributions , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Chin-Hui Lee,et al.  Segmental GPD training of HMM based speech recognizer , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[10]  Jay G. Wilpon,et al.  Discriminative feature selection for speech recognition , 1993, Comput. Speech Lang..

[11]  James R. Glass,et al.  A comparative study of signal representations and classification techniques for speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Biing-Hwang Juang,et al.  Speaker recognition based on minimum error discriminative training , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Li Deng Integrated optimization of dynamic feature parameters for hidden Markov modeling of speech , 1994, IEEE Signal Processing Letters.

[14]  Alain Biem,et al.  Filter bank design based on discriminative feature extraction , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Biing-Hwang Juang,et al.  A Minimum Error Rate Pattern Recognition Approach to Speech Recognition , 1994, Int. J. Pattern Recognit. Artif. Intell..

[16]  Hamid Sheikhzadeh,et al.  Waveform-based speech recognition using hidden filter models: parameter selection and sensitivity to power normalization , 1994, IEEE Trans. Speech Audio Process..

[17]  Shigeru Katagiri,et al.  Prototype-based minimum classification error/generalized probabilistic descent training for various speech units , 1994, Comput. Speech Lang..

[18]  Stephan Euler,et al.  Integrated optimization of feature transformation for speech recognition , 1995, EUROSPEECH.

[19]  Li Deng,et al.  Use of generalized dynamic feature parameters for speech recognition: maximum likelihood and minimum classification error approaches , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Oded Ghitza,et al.  A comparative study of mel cepstra and EIH for phone classification under adverse conditions , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[21]  Richard Lippmann,et al.  A comparison of signal processing front ends for automatic word recognition , 1995, IEEE Trans. Speech Audio Process..

[22]  Kuldip K. Paliwal,et al.  Minimum classification error training algorithm for feature extractor and pattern classifier in speech recognition , 1995, EUROSPEECH.

[23]  Chin-Hui Lee,et al.  Simultaneous ANN feature and HMM recognizer design using string-based minimum classification error (MCE) training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[24]  Li Deng,et al.  Use of generalized dynamic feature parameters for speech recognition , 1997, IEEE Trans. Speech Audio Process..