Optimization and evaluation of Gabor feature sets for ASR

In order to enhance automatic speech recognition performance in adverse conditions, Gabor features motivated by physiological measurements in the primary auditory cortex were optimized and evaluated. In the Aurora 2 experimental setup such localized, spectro-temporal filters combined with a Tandem system yield robust performance with a feature set size of 30. Improved results can be obtained when using a Hanning window instead of a cut-off Gaussian envelope due to better modulation frequency characteristics. An analysis of complementarity of Gabor and MFCC features shows that errors could be reduced by 55% with a perfect classifier. In a real world scenario, a relative WER reduction of 15% compared to a competitive baseline is achieved by combining the feature types, indicating the potential of this class of physiologically motivated features.

[1]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[2]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3]  David Gelbart,et al.  Improving word accuracy with Gabor feature extraction , 2002, INTERSPEECH.

[4]  Stephen V. David,et al.  Representation of Phonemes in Primary Auditory Cortex: How the Brain Analyzes Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  P LippmannRichard Speech recognition by machines and humans , 1997 .

[6]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[7]  T. Gramss Fast algorithms to find invariant features for a word recognizing neural net , 1991 .

[8]  Alexander Fischer,et al.  Progress with the philips continuous ASR system on the Aurora 2 noisy digits database , 2002, INTERSPEECH.

[9]  Tony Ezzat,et al.  Spectro-temporal analysis of speech using 2-d Gabor filters , 2007, INTERSPEECH.

[10]  Hynek Hermansky,et al.  Should recognizers have ears? , 1998, Speech Commun..

[11]  Birger Kollmeier,et al.  Phoneme confusions in human and automatic speech recognition , 2007, INTERSPEECH.

[12]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[13]  M. Kleinschmidt Methods for capturing spectro-temporal modulations in automatic speech recognition , 2001 .