Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features–A Theoretically Consistent Approach

In this work, we consider the problem of feature enhancement for noise-robust automatic speech recognition (ASR). We propose a method for minimum mean-square error (MMSE) estimation of mel-frequency cepstral features, which is based on a minimum number of well-established, theoretically consistent statistical assumptions. More specifically, the method belongs to the class of methods relying on the statistical framework proposed in Ephraim and Malah's original work (“Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, 1984). The method is general in that it allows MMSE estimation of mel-frequency cepstral coefficients (MFCC's), cepstral-mean subtracted (CMS-) MFCC's, autoregressive-moving-average (ARMA)-filtered CMS-MFCC's, velocity, and acceleration coefficients. In addition, the method is easily modified to take into account other compressive non-linearities than the logarithm traditionally used for MFCC computation. In terms of MFCC estimation performance, as measured by MFCC mean-square error, the proposed method shows performance which is identical to or better than other state-of-the-art methods. In terms of ASR performance, no statistical difference could be found between the proposed method and the state-of-the-art methods. We conclude that existing state-of-the-art MFCC feature enhancement algorithms within this class of algorithms, while theoretically suboptimal or based on theoretically inconsistent assumptions, perform close to optimally in the MMSE sense.

[1]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[3]  Jon Barker,et al.  Soft decisions in missing data techniques for robust automatic speech recognition , 2000, INTERSPEECH.

[4]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[5]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[6]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  S. Godsill,et al.  Simple alternatives to the Ephraim and Malah suppression rule for speech enhancement , 2001, Proceedings of the 11th IEEE Signal Processing Workshop on Statistical Signal Processing (Cat. No.01TH8563).

[8]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[9]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Zheng-Hua Tan,et al.  Automatic speech recognition on mobile devices and over communication networks , 2008 .

[11]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[12]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[13]  Paul Dalsgaard,et al.  Noise Condition-Dependent Training Based on Noise Classification and SNR Estimation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Yifan Gong,et al.  Robust Speech Recognition Using a Cepstral Minimum-Mean-Square-Error-Motivated Noise Suppressor , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Jesper Jensen,et al.  Minimum Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma Priors , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Richard M. Stern,et al.  Applying physiologically-motivated models of auditory processing to automatic speech recognition , 2011 .

[17]  Charles W. Therrien,et al.  Discrete Random Signals and Statistical Signal Processing , 1992 .

[18]  Israel Cohen,et al.  Speech enhancement using super-Gaussian speech models and noncausal a priori SNR estimation , 2005, Speech Commun..

[19]  Rainer Martin,et al.  Speech enhancement based on minimum mean-square error estimation and supergaussian priors , 2005, IEEE Transactions on Speech and Audio Processing.

[20]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[21]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[22]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[23]  John H. L. Hansen,et al.  Constrained iterative speech enhancement with application to speech recognition , 1991, IEEE Trans. Signal Process..

[24]  Jesper Jensen,et al.  DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement , 2013, DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement.

[25]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[26]  Zheng-Hua Tan,et al.  Automatic Speech Recognition on Mobile Devices and over Communication Networks (Advances in Pattern Recognition) , 2008 .

[27]  W. Bastiaan Kleijn,et al.  Codebook-Based Bayesian Speech Enhancement for Nonstationary Environments , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[29]  Yariv Ephraim,et al.  Statistical-model-based speech enhancement systems , 1992, Proc. IEEE.

[30]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Rainer Martin,et al.  A novel a priori SNR estimation approach based on selective cepstro-temporal smoothing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Yariv Ephraim,et al.  Recent Advancements in Speech Enhancement , 2004 .

[33]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[34]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[35]  Kuldip K. Paliwal,et al.  MMSE estimation of log-filterbank energies for robust speech recognition , 2011, Speech Commun..

[36]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  John H. L. Hansen,et al.  Compensation of SNR and noise type mismatch using an environmental sniffing based speech recognition solution , 2013, EURASIP J. Audio Speech Music. Process..

[38]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[39]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[40]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[41]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[42]  Richard J. Povinelli,et al.  Minimum Mean-Squared Error Estimation of Mel-Frequency Cepstral Coefficients Using a Novel Distortion Model , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Thippur V. Sreenivas,et al.  GMM based Bayesian approach to speech enhancement in signal / transform domain , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Susanto Rahardja,et al.  /spl beta/-order MMSE spectral amplitude estimation for speech enhancement , 2005, IEEE Transactions on Speech and Audio Processing.

[45]  Thippur V. Sreenivas,et al.  Codebook constrained Wiener filtering for speech enhancement , 1996, IEEE Trans. Speech Audio Process..