Amplitude Modulation Filters as Feature Sets for Robust ASR: Constant Absolute or Relative Bandwidth?

Many research efforts in the field of feature extraction for automatic speech recognition are focused on analyzing slow amplitude fluctuations of speech. In this study the importance of spectral and temporal resolution for the amplitude modulation frequency analysis are investigated in order to provide guidance for the appropriate filter design. Therefore, different wavelet and Fourier transform like filter time scales are examined, i.e. the importance of time and frequency separation is compared. The results demonstrate that analyzing three separate amplitude modulation frequency bands of constant absolute bandwidth that cover the range from about 2 to 16 Hz are sufficient for automatic speech recognition.

[1]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[2]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[3]  T. Houtgast,et al.  The Modulation Transfer Function in Room Acoustics as a Predictor of Speech Intelligibility , 1973 .

[4]  B. Kollmeier,et al.  Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes. , 2011, The Journal of the Acoustical Society of America.

[5]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[6]  Birger Kollmeier,et al.  Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[8]  B. Kollmeier,et al.  Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. , 1994, The Journal of the Acoustical Society of America.

[9]  Hynek Hermansky,et al.  Comparison of modulation features for phoneme recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  C. Schreiner,et al.  Periodicity coding in the inferior colliculus of the cat. I. Neuronal mechanisms. , 1988, Journal of neurophysiology.

[11]  Marc René Schädler,et al.  Comparing Different Flavors of Spectro-Temporal Features for ASR , 2011, INTERSPEECH.

[12]  T. Houtgast,et al.  A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria , 1985 .

[13]  Torsten Daub Modeling auditory processing of amplitude modulation I. Detection and masking with narrow-band carriers , 1997 .

[14]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[15]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[16]  Misha Pavel,et al.  Intelligibility of speech with filtered time trajectories of spectral envelopes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.