N-channel hidden Markov models for combined stressed speech classification and recognition

Robust speech recognition systems must address variations due to perceptually induced stress in order to maintain acceptable levels of performance in adverse conditions. One approach for addressing these variations is to utilize front-end stress classification to direct a stress dependent recognition algorithm which separately models each speech production domain. This study proposes a new approach which combines stress classification and speech recognition functions into one algorithm. This is accomplished by generalizing the one-dimensional (1-D) hidden Markov model to an N-channel hidden Markov model (N-channel HMM). Here, each stressed speech production style under consideration is allocated a dimension in the N-channel HMM to model each perceptually induced stress condition. It is shown that this formulation better integrates perceptually induced stress effects for stress independent recognition. This is due to the sub-phoneme (state level) stress classification that is implicitly performed by the algorithm. The proposed N-channel stress independent HMM method is compared to a previously established one-channel stress dependent isolated word recognition system yielding a 73.8% reduction in error rate. In addition, an 82.7% reduction in error rate is observed compared to the common one-channel neutral trained recognition approach.

[1]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..

[2]  John H. L. Hansen,et al.  Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect , 1994, IEEE Trans. Speech Audio Process..

[3]  H. S. Lee,et al.  Application of multi-layer perceptron in estimating speech/noise characteristics for speech recognition in noisy environment , 1995, Speech Commun..

[4]  John H. L. Hansen,et al.  Getting started with SUSAS: a speech under simulated and actual stress database , 1997, EUROSPEECH.

[5]  John H. L. Hansen,et al.  The Impact of Speech Under `Stress''on Military Speech Technology , 2000 .

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  John H. L. Hansen,et al.  Selective training for hidden Markov models with applications to speech classification , 1999, IEEE Trans. Speech Audio Process..

[8]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[9]  John H. L. Hansen,et al.  Feature analysis and neural network-based classification of speech under stress , 1996, IEEE Trans. Speech Audio Process..

[10]  John H. L. Hansen,et al.  Source generator equalization and enhancement of spectral properties for robust speech recognition in noise and stress , 1995, IEEE Trans. Speech Audio Process..

[11]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  D Cairns,et al.  NONLINEAR ANALYSIS AND DETECTION OF SPEECH UNDER STRESSED CONDITIONS , 1994 .

[13]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[14]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[15]  John H. L. Hansen,et al.  HMM-based stressed speech modeling with application to improved synthesis and recognition of isolated speech under stress , 1998, IEEE Trans. Speech Audio Process..

[16]  Simonov Pv,et al.  Analysis of the human voice as a method of controlling emotional state: achievements and goals. , 1977 .

[17]  John H. L. Hansen,et al.  Robust speech recognition training via duration and spectral-based stress token generation , 1995, IEEE Trans. Speech Audio Process..

[18]  D. B. Paul A speaker-stress resistant HMM isolated word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  M V Frolov,et al.  Analysis of the human voice as a method of controlling emotional state: achievements and goals. , 1977, Aviation, space, and environmental medicine.

[20]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[21]  John H. L. Hansen,et al.  Analysis and compensation of stressed and noisy speech with application to robust automatic recognition , 1988 .

[22]  John H. L. Hansen,et al.  Classification of speech under stress using target driven features , 1996, Speech Commun..

[23]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[24]  Biing-Hwang Juang,et al.  A study on speaker adaptation of the parameters of continuous density hidden Markov models , 1991, IEEE Trans. Signal Process..

[25]  Chuan Wang,et al.  Multi channel HMM , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[26]  John H. L. Hansen,et al.  Nonlinear analysis and classification of speech under stressed conditions , 1994 .

[27]  John H. L. Hansen,et al.  Stress independent robust HMM speech recognition using neural network stress classification , 1995, EUROSPEECH.

[28]  John H. L. Hansen,et al.  Improved HMM training and scoring strategies with application to accent classification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[29]  John H. L. Hansen,et al.  Improved speech recognition via speaker stress directed classification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[30]  K. Stevens,et al.  Emotions and speech: some acoustical correlates. , 1972, The Journal of the Acoustical Society of America.

[31]  John H. L. Hansen,et al.  A source generator based production model for environmental robustness in speech recognition , 1994, ICSLP.