Warping and scaling of the minimum variance distortionless response

Spectral estimation based on the minimum variance distortionless response (MVDR) is well-known in the signal processing literature and has been shown to be superior to linear prediction for robust speech recognition. In this work we propose two techniques to improve the resolution and the robustness of the MVDR spectral estimate: The first is a time-domain technique to estimate an all-pole model based on the warped short time frequency axis such as the Mel-frequency. The second is a method for scaling the height of the spectral envelope in order to extract robust features for large vocabulary continuous speech recognition systems which must operate in noisy conditions. Moreover, we show that these two techniques can be combined to good effect. In a series of speech recognition experiments on the Switchboard corpus, the combination of our proposed approaches achieved a word error rate (WER) of 35.9%, which is clearly superior to the 37.0% WER obtained by the common MVDR and the 37.2% WER obtained by the widely used Fourier transform.

[1]  H. Strube Linear prediction on a warped frequency scale , 1980 .

[2]  Jon Barker,et al.  Modelling the recognition of spectrally reduced speech , 1997, EUROSPEECH.

[3]  Bhaskar D. Rao,et al.  All-pole modeling of speech based on the minimum variance distortionless response spectrum , 2000, Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136).

[4]  M. Wolfel,et al.  Minimum variance distortionless response spectral estimation , 2005, IEEE Signal Processing Magazine.

[5]  Simon Haykin,et al.  Adaptive filter theory (2nd ed.) , 1991 .

[6]  Bhaskar D. Rao,et al.  All-pole modeling of speech based on the minimum variance distortionless response spectrum , 1997 .

[7]  B. Rao,et al.  All-pole model parameter estimation for voiced speech , 1997, 1997 IEEE Workshop on Speech Coding for Telecommunications Proceedings. Back to Basics: Attacking Fundamental Problems in Speech Coding.

[8]  Bruce R. Musicus Fast MLM power spectrum estimation from uniformly spaced correlations , 1985, IEEE Trans. Acoust. Speech Signal Process..

[9]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[10]  Matti Karjalainen Auditory Interpretation and Application of Warped Linear Prediction , 2001 .

[11]  Bhaskar D. Rao,et al.  MVDR based feature extraction for robust speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  John H. L. Hansen,et al.  A new perspective on feature extraction for robust in-vehicle speech recognition , 2003, INTERSPEECH.

[13]  Hiroshi Matsumoto,et al.  Evaluation of mel-LPC cepstrum in a large vocabulary continuous speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).