Modulation Spectrum Equalization for Improved Robust Speech Recognition

We propose novel approaches for equalizing the modulation spectrum for robust feature extraction in speech recognition. Common to all approaches in that the temporal trajectories of the feature parameters are first transformed into the magnitude modulation spectrum. In spectral histogram equalization (SHE) and two-band spectral histogram equalization (2B-SHE), we equalize the histogram of the modulation spectrum for each utterance to a reference histogram obtained from clean training data, or perform the equalization with two sub-bands on the modulation spectrum. In magnitude ratio equalization (MRE), we define the magnitude ratio of lower to higher modulation frequency components for each utterance, and equalize this to a reference value obtained from clean training data. These approaches can be viewed as temporal filters that are adapted to each testing utterance. Experiments performed on the Aurora 2 and 4 corpora for small and large vocabulary tasks indicate that significant performance improvements are achievable for all noise conditions. We also show that additional improvements can be obtained when these approaches are integrated with cepstral mean and variance normalization (CMVN), histogram equalization (HEQ), higher order cepstral moment normalization (HOCMN), or the advanced front-end (AFE). We analyze and discuss the reasons for these improvements from different viewpoints with different sets of data, including adaptive temporal filtering, noise behavior on the modulation spectrum, phoneme types, and modulation spectrum distance measures.

[1]  Mark J. F. Gales,et al.  Adaptive training with noisy constrained maximum likelihood linear regression for noise robust speech recognition , 2009, INTERSPEECH.

[2]  H. Hermansky,et al.  The modulation spectrum in the automatic recognition of speech , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[3]  Mark J. F. Gales,et al.  Cepstral parameter compensation for HMM recognition in noise , 1993, Speech Commun..

[4]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[5]  Mark J. F. Gales,et al.  Extended VTS for Noise-Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Satoshi Nakamura,et al.  Temporal modulation normalization for robust speech feature extraction and recognition , 2009, 2009 2nd International Congress on Image and Signal Processing.

[7]  Hoirin Kim,et al.  Probabilistic Class Histogram Equalization for Robust Speech Recognition , 2007, IEEE Signal Processing Letters.

[8]  Jeih-Weih Hung,et al.  Comparative analysis for data-driven temporal filters obtained via principal component analysis (PCA) and linear discriminant analysis (LDA) in speech recognition , 2001, INTERSPEECH.

[9]  Li Deng,et al.  Challenges in adopting speech recognition , 2004, CACM.

[10]  Andreas Stolcke,et al.  Trapping conversational speech: extending TRAP/tandem approaches to conversational telephone speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Jörg Meyer,et al.  Multi-channel speech enhancement in a car environment using Wiener filtering and spectral subtraction , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[13]  Jean Paul Haton,et al.  Statistical adaptation of acoustic models to noise conditions for robust speech recognition , 2002, INTERSPEECH.

[14]  Jeih-Weih Hung,et al.  Constructing Modulation Frequency Domain-Based Features for Robust Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Haizhou Li,et al.  Normalization of the Speech Modulation Spectra for Robust Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Jeih-Weih Hung,et al.  Optimization of temporal filters for constructing robust features in speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[18]  Misha Pavel,et al.  On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[19]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[20]  Friedrich Faubel,et al.  On expectation maximization based channel and noise estimation beyond the vector Taylor series expansion , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Hynek Hermansky,et al.  Robust spectro-temporal features based on autoregressive models of Hilbert envelopes , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Sarel van Vuuren,et al.  Data-driven design of RASTA-like filters , 1997, EUROSPEECH.

[23]  Lin-Shan Lee,et al.  Higher Order Cepstral Moment Normalization for Improved Robust Speech Recognition , 2009, IEEE Trans. Speech Audio Process..

[24]  Douglas D. O'Shaughnessy,et al.  Invited paper: Automatic speech recognition: History, methods and challenges , 2008, Pattern Recognit..

[25]  Javier Ramírez,et al.  Normalization of the inter-frame information using smoothing filtering , 2006, INTERSPEECH.

[26]  Hynek Hermansky TRAP-TANDEM: data-driven extraction of temporal features from speech , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[27]  Anshu Agarwal,et al.  TWO-STAGE MEL-WARPED WIENER FILTER FOR ROBUST SPEECH RECOGNITION , 1999 .

[28]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[29]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[30]  Tatsuya Kawahara,et al.  Optimizing spectral subtraction and wiener filtering for robust speech recognition in reverberant and noisy conditions , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[32]  Hervé Bourlard,et al.  Mel-cepstrum modulation spectrum (MCMS) features for robust ASR , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[33]  Jeih-Weih Hung,et al.  Data-driven temporal filters based on multi-eigenvectors for robust features in speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[34]  Satoshi Nakamura,et al.  Normalization on Temporal Modulation Transfer Function for Robust Speech Recognition , 2008, 2008 Second International Symposium on Universal Communication.

[35]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[36]  Yifan Gong,et al.  High-performance hmm adaptation with joint compensation of additive and convolutive distortions via Vector Taylor Series , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[37]  Philipos C. Loizou,et al.  A multi-band spectral subtraction method for enhancing speech corrupted by colored noise , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Javier Ramírez,et al.  Cepstral domain segmental nonlinear feature transformations for robust speech recognition , 2004, IEEE Signal Processing Letters.

[39]  Abeer Alwan,et al.  Temporal modulation processing of speech signals for noise robust ASR , 2009, INTERSPEECH.

[40]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[41]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[42]  Hermann Ney,et al.  Quantile based histogram equalization for noise robust large vocabulary speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Roberto Gemello,et al.  Progressive memory-based parametric non-linear feature equalization , 2009, INTERSPEECH.

[44]  Haizhou Li,et al.  Normalizing the Speech Modulation Spectrum for Robust Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[45]  John H. L. Hansen,et al.  Feature compensation in the cepstral domain employing model combination , 2009, Speech Commun..

[46]  Nelson Morgan,et al.  Learning long-term temporal features in LVCSR using neural networks , 2004, INTERSPEECH.

[47]  Satoshi Nakamura,et al.  Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition , 2010, Speech Commun..

[48]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[49]  Jeih-Weih Hung,et al.  Sub-band modulation spectrum compensation for robust speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[50]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[51]  Lin-Shan Lee,et al.  Higher order cepstral moment normalization (HOCMN) for robust speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[53]  Lin-Shan Lee,et al.  Modulation spectrum equalization for robust speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).