Robust Feature Extraction for Continuous Speech Recognition Using the MVDR Spectrum Estimation Method

This paper describes a robust feature extraction technique for continuous speech recognition. Central to the technique is the minimum variance distortionless response (MVDR) method of spectrum estimation. We consider incorporating perceptual information in two ways: 1) after the MVDR power spectrum is computed and 2) directly during the MVDR spectrum estimation. We show that incorporating perceptual information directly into the spectrum estimation improves both robustness and computational efficiency significantly. We analyze the class separability and speaker variability properties of the features using a Fisher linear discriminant measure and show that these features provide better class separability and better suppression of speaker-dependent information than the widely used mel frequency cepstral coefficient (MFCC) features. We evaluate the technique on four different tasks: an in-car speech recognition task, the Aurora-2 matched task, the Wall Street Journal (WSJ) task, and the Switchboard task. The new feature extraction technique gives lower word-error-rates than the MFCC and perceptual linear prediction (PLP) feature extraction techniques in most cases. Statistical significance tests reveal that the improvement is most significant in high noise conditions. The technique thus provides improved robustness to noise without sacrificing performance in clean conditions

[1]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[3]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[4]  Jian Li,et al.  On robust Capon beamforming and diagonal loading , 2003, IEEE Trans. Signal Process..

[5]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[6]  Reinhold Häb-Umbach Investigations on inter-speaker variability in the feature space , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[7]  W. M. Carey,et al.  Digital spectral analysis: with applications , 1986 .

[8]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[9]  Bhaskar D. Rao,et al.  All-pole modeling of speech based on the minimum variance distortionless response spectrum , 2000, Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136).

[10]  Ali H. Sayed,et al.  A survey of spectral factorization methods , 2001, Numer. Linear Algebra Appl..

[11]  Ponani S. Gopalakrishnan,et al.  Clustering via the Bayesian information criterion with applications in speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  N. Kalouptsidis,et al.  Spectral analysis , 1993 .

[13]  Melvyn J. Hunt,et al.  Spectral Signal Processing for ASR , 2007 .

[14]  Denis Jouvet,et al.  Evaluation of a noise-robust DSR front-end on Aurora databases , 2002, INTERSPEECH.

[15]  Kadri Hacioglu,et al.  Recent improvements in the CU Sonic ASR system for noisy speech: the SPINE task , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  M. Wolfel,et al.  Minimum variance distortionless response spectral estimation , 2005, IEEE Signal Processing Magazine.

[17]  Asunción Moreno,et al.  Maximum likelihood filters in spectral estimation problems , 1986 .

[18]  G. Lothian,et al.  Spectral Analysis , 1971, Nature.

[19]  P. Welch The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms , 1967 .

[20]  J. Capon High-resolution frequency-wavenumber spectrum analysis , 1969 .

[21]  Bhaskar D. Rao,et al.  Techniques for capturing temporal variations in speech signals with fixed-rate processing , 1998, ICSLP.

[22]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[23]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[24]  George Saon,et al.  Robust digit recognition in noisy environments: the IBM Aurora 2 system , 2001, INTERSPEECH.

[25]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[26]  Liang Gu,et al.  Perceptual harmonic cepstral coefficients as the front-end for speech recognition , 2000, INTERSPEECH.

[27]  Michael Picheny,et al.  Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[28]  Jean-Pierre Adoul,et al.  Frequency-domain spectral envelope estimation for low rate coding of speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[29]  Bruce R. Musicus Fast MLM power spectrum estimation from uniformly spaced correlations , 1985, IEEE Trans. Acoust. Speech Signal Process..

[30]  Geoffrey Zweig,et al.  Toward domain-independent conversational speech recognition , 2003, INTERSPEECH.

[31]  F. Samarotto David Neumeyer and Susan Tepping. A Guide to Schenkerian Analysis. Englewood Cliffs, NJ: Prentice-Hall, 1992 , 1993 .

[32]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[33]  Jian Li,et al.  A new derivation of the APES filter , 1999, IEEE Signal Processing Letters.

[34]  Bhaskar D. Rao,et al.  MVDR based feature extraction for robust speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[35]  Satya Dharanipragada,et al.  Perceptual MVDR-based cepstral coefficients (PMCCs) for robust speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..