Improved phoneme recognition by integrating evidence from spectro-temporal and cepstral features

Gabor features have been proposed for extracting spectro-temporal modulation information, and yielding significant improvements in recognition performance. In this paper, we propose the integration of Gabor posteriors with MFCC posteriors, yielding a relative improvement of 14.3% over an MFCC Tandem system. We analyze for different types of acoustic units the complementarity between Gabor features with long-term spectro-temporal modulation information in the mel-spectrogram and MFCC features with short-term temporal information in the cepstral domain. It is found that Gabor features are better for vowel recognition while MFCCs are better for consonants. This explains why their integration offers improvements.

[1]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[2]  Nelson Morgan,et al.  Multi-stream spectro-temporal features for robust speech recognition , 2008, INTERSPEECH.

[3]  Birger Kollmeier,et al.  Complementarity of MFCC, PLP and Gabor features in the presence of speech-intrinsic variabilities , 2009, INTERSPEECH.

[4]  David Gelbart,et al.  Improving word accuracy with Gabor feature extraction , 2002, INTERSPEECH.

[5]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[6]  Lin-Shan Lee,et al.  Data-driven clustered hierarchical tandem system for LVCSR , 2008, INTERSPEECH.

[7]  S. Shamma,et al.  Spectro-temporal modulation transfer functions and speech intelligibility. , 1999, The Journal of the Acoustical Society of America.

[8]  Jonathan Z. Simon,et al.  Robust Spectrotemporal Reverse Correlation for the Auditory System: Optimizing Stimulus Design , 2000, Journal of Computational Neuroscience.

[9]  David Gelbart,et al.  Ensemble Feature Selection for Multi-Stream Automatic Speech Recognition , 2008 .

[10]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[11]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[12]  Nelson Morgan,et al.  Multi-stream to many-stream: using spectro-temporal features for ASR , 2009, INTERSPEECH.

[13]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..