Robust feature-estimation and objective quality assessment for noisy speech recognition using the Credit Card corpus

The introduction of acoustic background distortion into speech causes recognition algorithms to fail. In order to improve the environmental robustness of speech recognition in adverse conditions, a novel constrained-iterative feature-estimation algorithm is considered and shown to produce improved feature characterization in a variety of actual noise conditions. In addition, an objective measure based MAP estimator is formulated as a means of predicting changes in robust recognition performance at the speech feature extraction stage. The four measures considered include (i) NIST SNR; (ii) Itakura-Saito log-likelihood; (iii) log-area-ratio; (iv) the weighted-spectral slope measure. A continuous distribution, monophone based, hidden Markov model recognition algorithm is used for objective measure based MAP estimator analysis and recognition evaluation. Evaluations were based on speech data from the Credit Card corpus (CC-DATA). It is shown that feature enhancement provides a consistent level of recognition improvement for broadband, and low-frequency colored noise sources. As the stationarity assumption for a given noise source breaks down, the ability of feature enhancement to improve recognition performance decreases. Finally, the log-likelihood based MAP estimator was found to be the best predictor of recognition performance, while the NIST SNR based MAP estimator was found to be poorest recognition predictor across the 27 noise conditions considered. >

[1]  Brian Hanson,et al.  Robust speaker-independent word recognition using static, dynamic and acceleration features: experiments with Lombard and noisy speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[2]  David G. Messerschmitt,et al.  A frequency weighted Itakura-Saito spectral distance measure , 1982 .

[3]  Zinny S. Bond,et al.  A note on loud and lombard speech , 1990, ICSLP.

[4]  John H. L. Hansen,et al.  ICARUS: an mwave-based real-time speech recognition system in noise and lombard effect , 1992, ICSLP.

[5]  John H. L. Hansen,et al.  Evaluation of speech under stress and emotional conditions , 1987 .

[6]  R. H. Bernacki,et al.  Effects of noise on speech production: acoustic and perceptual analyses. , 1988, The Journal of the Acoustical Society of America.

[7]  Yariv Ephraim,et al.  Statistical-model-based speech enhancement systems , 1992, Proc. IEEE.

[8]  Richard M. Stern,et al.  Efficient joint compensation of speech for the effects of additive noise and linear filtering , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Biing-Hwang Juang,et al.  A family of distortion measures based upon projection operation for robust speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[10]  Schuyler Quackenbush,et al.  Objective measures of speech quality , 1995 .

[11]  Sadaoki Furui,et al.  Line spectrum pair frequency - based distance measures for speech recognition , 1990, ICSLP.

[12]  John H. L. Hansen,et al.  Analysis and compensation of stressed and noisy speech with application to robust automatic recognition , 1988 .

[13]  B. J. Stanton,et al.  Robust recognition of loud and Lombard speech in the fighter cockpit environment , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[14]  Biing-Hwang Juang,et al.  A family of distortion measures base upon projection operation for robust speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[15]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[16]  Dennis H. Klatt,et al.  Prediction of perceived phonetic distance from critical-band spectra: A first step , 1982, ICASSP.

[17]  Yeunung Chen,et al.  Cepstral domain talker stress compensation for robust speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[18]  Y. Ephraim Statistical model-based speech enhancement systems , 1988 .

[19]  Benjamin Peter Milner,et al.  Speech recognition in adverse environments , 1994 .

[20]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[21]  Yariv Ephraim,et al.  Speech enhancement based upon hidden Markov modeling , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[22]  Osamu Fujimura 1990 International Conference on Spoken Language Processing , 1992 .

[23]  W. Voiers,et al.  Diagnostic acceptability measure for speech communication systems , 1977 .

[24]  T. Martin,et al.  On the effects of varying filter bank parameters on isolated word recognition , 1982 .

[25]  John H. L. Hansen,et al.  Adaptive source generator compensation and enhancement for speech recognition in noisy stressful environments , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  John H. L. Hansen,et al.  Minimum cost based phoneme class detection for improved iterative speech enhancement , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[28]  R. Gray,et al.  Distortion measures for speech processing , 1980 .

[29]  Thomas P. Barnwell,et al.  An LSP based speech quality measure , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[30]  M. Gardner Effect of Noise, System Gain, and Assigned Task on Talking Levels in Loudspeaker Communication , 1966 .

[31]  John H. L. Hansen,et al.  Iterative speech enhancement with spectral constraints , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  John H. L. Hansen,et al.  Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect , 1994, IEEE Trans. Speech Audio Process..

[33]  D. B. Paul A speaker-stress resistant HMM isolated word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  B. Widrow,et al.  Adaptive noise cancelling: Principles and applications , 1975 .

[35]  Steve Young,et al.  Speech recognition using hidden Markov model decomposition and a general background speech model , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[37]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  P. de Souza,et al.  Statistical tests and distance measures for LPC coefficients , 1977 .

[39]  John H. L. Hansen,et al.  Constrained iterative speech enhancement with application to automatic speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[40]  Mark A. Clements,et al.  Speech recognition in noise using a projection-based likelihood measure for mixture density HMM's , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  John H. L. Hansen,et al.  Stress compensation and noise reduction algorithms for robust speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[42]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[43]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[44]  Hynek Hermansky,et al.  Recognition of speech in additive and convolutional noise based on RASTA spectral processing , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  John H. L. Hansen,et al.  Evaluation of acoustic correlates of speech under stress for robust speech recognition , 1989, Proceedings of the Fifteenth Annual Northeast Bioengineering Conference.

[46]  C. Lefebvre,et al.  A comparison of several acoustic representations for speech recognition with degraded and undegraded speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[47]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[48]  John H. L. Hansen,et al.  A new dual-channel speech enhancement technique with application to CELP coding in noise , 1992, ICSLP.

[49]  A. Gray,et al.  Distance measures for speech processing , 1976 .

[50]  J R Cohen,et al.  Application of an auditory model to speech recognition. , 1989, The Journal of the Acoustical Society of America.

[51]  George R. Doddington,et al.  Recognition of speech under stress and in noise , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  C N HANLEY,et al.  QUANTIFYING THE LOMBARD EFFECT. , 1965, The Journal of speech and hearing disorders.

[53]  John H. L. Hansen,et al.  Lombard effect compensation for robust automatic speech recognition in noise , 1990, ICSLP.

[54]  John H. L. Hansen,et al.  Duration and spectral based stress token generation for HMM speech recognition under stress , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[55]  John H. L. Hansen,et al.  Constrained iterative speech enhancement with application to speech recognition , 1991, IEEE Trans. Signal Process..

[56]  Allen Gersho,et al.  Auditory distortion measure for speech coding , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[57]  Thomas P. Barnwell,et al.  Segmental preclassification for improved objective speech quality measures , 1981, ICASSP.

[58]  B. J. Stanton,et al.  Acoustic-phonetic analysis of loud and Lombard speech in simulated cockpit conditions , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.