Localized spectro-temporal features for automatic speech recognition

Recent results from physiological and psychoacoustic studies indicate that spectrally and temporally localized time-frequency envelope patterns form a relevant basis of auditory perception. This motivates new approaches to feature extraction for automatic speech recognition (ASR) which utilize two-dimensional spectro-temporal modulation filters. The paper provides a motivation and a brief overview on the work related to Localized Spectro-Temporal Features (LSTF). It further focuses on the Gabor feature approach, where a feature selection scheme is applied to automatically obtain a suitable set of Gabor-type features for a given task. The optimized feature sets are examined in ASR experiments with respect to robustness and their statistical properties are analyzed. 1. Getting auditory ... again? The question whether knowledge about the (human) auditory system provides valuable contributions to the design of ASR systems is as old as the field itself. The topic has been discussed extensively elsewhere (e.g. [1]). After all these years, a major argument still holds, namely the large gap in performance between normal-hearing native listeners and state-of-the art ASR systems. Consistently, humans outperform machines by at least an order of magnitude [2]. Human listeners recognize speech even in very adverse acoustical environments with strong reverberation and interfering sound sources. However, this discrepancy between human and machine performance is not restricted to robustness alone. It is observed also in undisturbed conditions and very small context independent corpora, where higher level constraints (cognitive aspects, language model) do not play a role. Arguably this hints towards an insufficient feature extraction in machine recognition systems. It is argued here, that including LSTF streams provides another step towards human-like speech recognition. 2. Evidence for (spectro-)temporal processing in the auditory system Speech is characterized by its fluctuations across time and frequency. The latter reflect the characteristics of the human vocal cords and tract and are commonly exploited in ASR by using short-term spectral representations such as cepstral coefficients. The temporal properties of speech are targeted in ASR by dynamic (delta and delta-delta) features and temporal filtering and feature extraction techniques like RASTA [3] and TRAPS [4]. Nevertheless, speech clearly exhibits combined spectro-temporal modulations. This is due to intonation, co-articulation and the succession of several phonetic elements, e.g., in a syllable. Formant transitions, for example, result in diagonal features in a spectrogram representation of speech. This kind of pattern is captured by LSTF and explicitly targeted by the Gabor feature extrac

[1]  Harvey Fletcher,et al.  Speech and hearing. , 1930, Health services manager.

[2]  Panu Somervuo,et al.  Experiments with linear and nonlinear feature transformations in HMM based phone recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  J Tchorz,et al.  A model of auditory perception as front end for automatic speech recognition. , 1999, The Journal of the Acoustical Society of America.

[4]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[5]  Climent Nadeu,et al.  Time and frequency filtering of filter-bank energies for robust HMM speech recognition , 2000, Speech Commun..

[6]  Ce Schreiner,et al.  Spectral envelope coding in cat primary auditory cortex: Properties of ripple transfer functions , 1994 .

[7]  M. Kleinschmidt Methods for capturing spectro-temporal modulations in automatic speech recognition , 2001 .

[8]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[9]  K.P. Kording,et al.  Learning of sparse auditory receptive fields , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[10]  Hynek Hermansky,et al.  Should recognizers have ears? , 1998, Speech Commun..

[11]  Steven Greenberg,et al.  Speech intelligibility derived from exceedingly sparse spectral information , 1998, ICSLP.

[12]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[13]  B. Kollmeier,et al.  Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. , 1997, The Journal of the Acoustical Society of America.

[14]  Misha Pavel,et al.  On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[15]  Hynek Hermansky,et al.  Beyond a single critical-band in TRAP based ASR , 2003, INTERSPEECH.

[16]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17]  David Gelbart,et al.  Improving word accuracy with Gabor feature extraction , 2002, INTERSPEECH.

[18]  Steven Greenberg,et al.  The relation between speech intelligibility and the complex modulation spectrum , 2001, INTERSPEECH.

[19]  Hervé Bourlard,et al.  Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions , 1997, Summer School on Neural Networks.

[20]  T. Dau Modeling auditory processing of amplitude modulation , 1997 .

[21]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[22]  Harvey b. Fletcher,et al.  Speech and hearing in communication , 1953 .

[23]  S. Shamma,et al.  Spectro-temporal modulation transfer functions and speech intelligibility. , 1999, The Journal of the Acoustical Society of America.

[24]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[25]  Hans Werner Strube,et al.  Recognition of isolated words based on psychoacoustics and neurobiology , 1990, Speech Commun..

[26]  T. Gramss Fast algorithms to find invariant features for a word recognizing neural net , 1991 .