Robust Log-Energy Estimation and its Dynamic Change Enhancement for In-car Speech Recognition

The log-energy parameter, typically derived from a full-band spectrum, is a critical feature commonly used in automatic speech recognition (ASR) systems. However, log-energy is difficult to estimate reliably in the presence of background noise. In this paper, we theoretically show that background noise affects the trajectories of not only the “conventional” log-energy, but also its delta parameters. This results in a poor estimation of the actual log-energy and its delta parameters, which no longer describe the speech signal. We thus propose a new method to estimate log-energy from a sub-band spectrum, followed by dynamic change enhancement and mean smoothing. We demonstrate the effectiveness of the proposed log-energy estimation and its post-processing steps through speech recognition experiments conducted on the in-car CENSREC-2 database. The proposed log-energy (together with its corresponding delta parameters) yields an average improvement of 32.8% compared with the baseline front-ends. Moreover, it is also shown that further improvement can be achieved by incorporating the new Mel-Frequency Cepstral Coefficients (MFCCs) obtained by non-linear spectral contrast stretching.

[1]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[2]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[3]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[4]  Q. Summerfield,et al.  Auditory enhancement of changes in spectral amplitude. , 1987, The Journal of the Acoustical Society of America.

[5]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[6]  Jay G. Wilpon,et al.  Discriminative analysis for feature reduction in automatic speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Patrice Alexandre,et al.  Root cepstral analysis: A unified view. Application to speech processing in car noise environments , 1993, Speech Commun..

[8]  John S. D. Mason,et al.  On the limitations of cepstral features in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[10]  P F Assmann,et al.  Time-varying spectral change in the vowels of children and adults. , 2000, The Journal of the Acoustical Society of America.

[11]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[12]  P C Loizou,et al.  Minimum spectral contrast needed for vowel identification by normal hearing and cochlear implant listeners. , 2001, The Journal of the Acoustical Society of America.

[13]  Hynek Hermansky,et al.  Nonlinear spectral transformations for robust speech recognition , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[14]  Michael Kiefte,et al.  Sensitivity to change in perception of speech , 2003, Speech Commun..

[15]  Fabrice Labeau,et al.  Discrete Time Signal Processing , 2004 .

[16]  R. Fay,et al.  Speech Processing in the Auditory System , 2010, Springer Handbook of Auditory Research.

[17]  George Saon,et al.  Feature space Gaussianization , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[19]  Satoshi Nakamura,et al.  CENSREC2: corpus and evaluation environments for in car continuous digit speech recognition , 2006, INTERSPEECH.

[20]  Weifeng Li,et al.  Non-linear spectral contrast stretching for in-car speech recognition , 2007, INTERSPEECH.

[21]  Jing Chen,et al.  Effects of enhancement of spectral changes on speech quality and subjective speech intelligibility , 2010, INTERSPEECH.

[22]  Jen-Tzung Chien,et al.  Bayesian sensing hidden Markov models for speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .