Improved cepstral mean and variance normalization using Bayesian framework

Cepstral Mean and Variance Normalization (CMVN) is a computationally efficient normalization technique for noise robust speech recognition. The performance of CMVN is known to degrade for short utterances, due to insufficient data for parameter estimation and loss of discriminable information as all utterances are forced to have zero mean and unit variance. In this work, we propose to use posterior estimates of mean and variance in CMVN, instead of the maximum likelihood estimates. This Bayesian approach, in addition to providing a robust estimate of parameters, is also shown to preserve discriminable information without increase in computational cost, making it particularly relevant for Interactive Voice Response (IVR)-based applications. The relative WER reduction of this approach w.r.t. Cepstral Mean Normalization, CMVN and Histogram Equalization are (i) 40.1%, 27% and 4.3% with the Aurora2 database for all utterances, (ii) 25.7%, 38.6% and 30.4% with the Aurora2 database for short utterances, and (iii) 18.7%, 12.6% and 2.5% with the Aurora4 database.

[1]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[2]  Hynek Hermansky,et al.  Multi-band and adaptation approaches to robust speech recognition , 1997, EUROSPEECH.

[3]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Richard M. Stern,et al.  Normalization of time-derivative parameters using histogram equalization , 2003, INTERSPEECH.

[5]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[6]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[7]  Climent Nadeu,et al.  On Real-Time Mean-and-Variance Normalization of Speech Recognition Features , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Vikas Joshi,et al.  Modified cepstral mean normalization - transforming to utterance specific non-zero mean , 2013, INTERSPEECH.

[9]  Ole Morten Strand,et al.  Cepstral mean and variance normalization in the model domain , 2004 .

[10]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[11]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[12]  Hermann Ney,et al.  Quantile based histogram equalization for noise robust large vocabulary speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.