Combining speech enhancement and auditory feature extraction for robust speech recognition

Abstract A major deficiency in state-of-the-art automatic speech recognition (ASR) systems is the lack of robustness in additive and convolutional noise. The model of auditory perception (PEMO), developed by Dau et al. (T. Dau, D. Puschel, A. Kohlrausch, J. Acoust. Soc. Am. 99 (6) (1996) 3615–3622) for psychoacoustical purposes, partly overcomes these difficulties when used as a front end for automatic speech recognition. To further improve the performance of this auditory-based recognition system in background noise, different speech enhancement methods were examined, which have been evaluated in earlier studies as components of digital hearing aids. Monaural noise reduction, as proposed by Ephraim and Malah (Y. Ephraim, D. Malah, IEEE Trans. Acoust. Speech Signal Process. ASSP-32 (6) (1984) 1109–1121) was compared to a binaural filter and dereverberation algorithm after Wittkop et al. (T. Wittkop, S. Albani, V. Hohmann, J. Peissig, W. Woods, B. Kollmeier, Acustica United with Acta Acustica 83 (4) (1997) 684–699). Both noise reduction algorithms yield improvements in recognition performance equivalent to up to 10 dB SNR in non-reverberant conditions for all types of noise, while the performance in clean speech is not significantly affected. Even in real-world reverberant conditions the speech enhancement schemes lead to improvements in recognition performance comparable to an SNR gain of up to 5 dB. This effect exceeds the expectations as earlier studies found no increase in speech intelligibility for hearing-impaired human subjects.

[1]  Olivier Cappé,et al.  Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor , 1994, IEEE Trans. Speech Audio Process..

[2]  Patrick Wambacq,et al.  Fully adaptive SVD-based noise removal for robust speech recognition , 1999, EUROSPEECH.

[3]  Jörg Meyer,et al.  Multi-channel speech enhancement in a car environment using Wiener filtering and spectral subtraction , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Torsten Dau,et al.  Psychophysics physiology and models of hearing , 1999 .

[5]  Timothy R. Anderson,et al.  Binaural phoneme recognition using the auditory image model and cross-correlation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  B Kollmeier,et al.  Real-time multiband dynamic compression and noise reduction for binaural hearing aids. , 1993, Journal of rehabilitation research and development.

[7]  Richard Lippmann,et al.  A comparison of signal processing front ends for automatic word recognition , 1995, IEEE Trans. Speech Audio Process..

[8]  Torsten Dau,et al.  Modeling the `Effective' Binaural Signal Processing in Detection Experiments , 1999 .

[9]  H. Steven Colburn,et al.  Computational Models of Binaural Processing , 1996 .

[10]  Birger Kollmeier,et al.  Combining Monaural Noise Reduction Algorithms and Perceptive Preprocessing for Robust Speech Recognition , 1999 .

[11]  T. Dau,et al.  A quantitative model of the "effective" signal processing in the auditory system. II. Simulations and measurements. , 1996, The Journal of the Acoustical Society of America.

[12]  Jean-Claude Junqua,et al.  Techniques for robust speech recognition in the car environment , 1999, EUROSPEECH.

[13]  T Dau,et al.  A quantitative model of the "effective" signal processing in the auditory system. I. Model structure. , 1996, The Journal of the Acoustical Society of America.

[14]  Maurizio Omologo,et al.  Microphone array based speech recognition with different talker-array positions , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  H. Wust,et al.  A speech recognizer with low complexity based on RNN , 1995, Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.

[16]  B Kollmeier,et al.  Speech intelligibility prediction in hearing-impaired listeners based on a psychoacoustically motivated perception model. , 1996, The Journal of the Acoustical Society of America.

[17]  B. Kollmeier,et al.  Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. , 1997, The Journal of the Acoustical Society of America.

[18]  T. Dau Modeling auditory processing of amplitude modulation , 1997 .

[19]  A.R.D. Thornton,et al.  Foundations of Modern Auditory Theory , 1970 .

[20]  J Tchorz,et al.  A model of auditory perception as front end for automatic speech recognition. , 1999, The Journal of the Acoustical Society of America.

[21]  Herbert Reininger,et al.  Evaluation of PEMO in robust speech recognition , 1999 .

[22]  Satoshi Takahashi,et al.  A microphone array system for speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Andrew C. Morris,et al.  A comparison of two strategies for ASR in additive noise: missing data and spectral subtraction , 1999, EUROSPEECH.

[24]  Misha Pavel,et al.  On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[25]  Phil D. Green,et al.  Missing data theory, spectral subtraction and signal-to-noise estimation for robust ASR: an integrated study , 1999, EUROSPEECH.

[26]  Erling Nilsson,et al.  On subjective impact sound insulation classes , 1999 .

[27]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[28]  Birger Kollmeier,et al.  DEVELOPMENT AND EVALUATION OF SINGLE-MICROPHONE NOISE REDUCTION ALGORITHMS FOR DIGITAL HEARING AIDS , 1999 .

[29]  N. I. Durlach,et al.  Binaural signal detection - Equalization and cancellation theory. , 1972 .

[30]  Birger Kollmeier,et al.  Combination of monaural and binaural noise suppression algorithms and its use for the hearing impaired , 1999 .

[31]  Birger Kollmeier,et al.  On the interplay between auditory-based features and locally recurrent neural networks for robust speech recognition in noise , 1997, EUROSPEECH.

[32]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[33]  Katsuhiko Shirai,et al.  Speech recognition in nonstationary noise based on parallel HMMs and spectral subtraction , 1996, Systems and Computers in Japan.

[34]  Birger Kollmeier,et al.  Using a quantitative psychoacoustical signal representation for objective speech quality measurement , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Hans Werner Strube,et al.  Noise reduction for speech signals by operations on the modulation frequency spectrum , 1999 .

[36]  Birger Kollmeier,et al.  Noise reduction strategies employing interaural parameters , 1999 .

[37]  Oded Ghitza,et al.  Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in a noisy environment , 1988 .

[38]  B Kollmeier,et al.  Directivity of binaural noise reduction in spatial multiple noise-source arrangements for normal and impaired listeners. , 1997, The Journal of the Acoustical Society of America.

[39]  Herbert Reininger,et al.  Exploiting the potential of auditory preprocessing for robust speech recognition by locally recurrent neural networks , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[40]  Birger Kollmeier,et al.  Evaluation of monaural and binaural speech enhancement for robust auditory‐based automatic speech recognition , 1999 .