Effectiveness of Histogram Equalization and SyDOCC Features on Speech Recognition Performance on a Real-World Noisy Speech Task

When building systems for automatic speech recognition, one often faces the challenge of dealing with speech signals containing noise. This additional noise leads to a drop in recognition performance, especially, when the acoustic environment varies during training and testing of a system. There exist several approaches to deal with noisy data or mismatched conditions. We evaluate two different approaches: Histogram Equalization (HEQ) and Synchronized Damped Oscillator Cepstral Coefficients (SyDOCC). While HEQ tries to normalize the statistical properties of the input features in an unsupervised manner without requiring a noise estimate, SyDOCCs model the acoustic properties of the human ear more accurately than Mel-Frequency Cepstral Coefficients (MFCCs). We evaluate both approaches using data with artificially added noise as well as data that contains genuine noise due to the recording conditions.

[1]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[2]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[3]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[4]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Richard J. Mammone,et al.  Non-parametric estimation and correction of non-linear distortion in speech systems , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[7]  Martin Graciarena,et al.  Damped oscillator cepstral coefficients for robust speech recognition , 2013, INTERSPEECH.

[8]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Hermann Ney,et al.  Quantile based histogram equalization for noise robust speech recognition , 2001, INTERSPEECH.

[10]  Finn Dag Buø,et al.  JANUS 93: towards spontaneous speech translation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Antonio M. Peinado,et al.  Non-linear transformations of the feature space for robust Speech Recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Mukund Padmanabhan,et al.  A nonlinear unsupervised adaptation technique for speech recognition , 2000, INTERSPEECH.

[13]  A. J. Hudspeth,et al.  How the ear's works work , 1989, Nature.

[14]  Hermann Ney,et al.  Histogram based normalization in the acoustic feature space , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..