Short-time instantaneous frequency and bandwidth features for speech recognition

In this paper, we investigate the performance of modulation related features and normalized spectral moments for automatic speech recognition. We focus on the short-time averages of the amplitude weighted instantaneous frequencies and bandwidths, computed at each subband of a mel-spaced filterbank. Similar features have been proposed in previous studies, and have been successfully combined with MFCCs for speech and speaker recognition. Our goal is to investigate the stand-alone performance of these features. First, it is experimentally shown that the proposed features are only moderately correlated in the frequency domain, and, unlike MFCCs, they do not require a transformation to the cepstral domain. Next, the filterbank parameters (number of filters and filter overlap) are investigated for the proposed features and compared with those of MFCCs. Results show that frequency related features perform at least as well as MFCCs for clean conditions, and yield superior results for noisy conditions; up to 50% relative error rate reduction for the AURORA3 Spanish task.

[1]  Fred Cummins,et al.  Speaker Identification Using Instantaneous Frequencies , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Petros Maragos,et al.  On amplitude and frequency demodulation using energy operators , 1993, IEEE Trans. Signal Process..

[3]  Qi Li,et al.  Recognition of noisy speech using dynamic spectral subband centroids , 2004, IEEE Signal Processing Letters.

[4]  Petros Maragos,et al.  Time-frequency distributions for automatic speech recognition , 2001, IEEE Trans. Speech Audio Process..

[5]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[6]  P. Maragos,et al.  Speech formant frequency and bandwidth tracking using multiband energy demodulation , 1996 .

[7]  Dimitrios Dimitriadis,et al.  Spectral Moment Features Augmented by Low Order Cepstral Coefficients for Robust ASR , 2010, IEEE Signal Processing Letters.

[8]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[9]  Petros Maragos,et al.  Robust AM-FM features for speech recognition , 2005, IEEE Signal Processing Letters.

[10]  Petros Maragos,et al.  Speech analysis and synthesis using an AM-FM modulation model , 1997, Speech Commun..

[11]  Douglas A. Reynolds,et al.  Measuring fine structure in speech: application to speaker identification , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[12]  Petros Maragos,et al.  A Comparison of the Squared Energy and Teager-Kaiser Operators for Short-Term Energy Estimation in Additive Noise , 2009, IEEE Transactions on Signal Processing.