An Auditory Inspired Amplitude Modulation Filter Bank for Robust Feature Extraction in Automatic Speech Recognition

The human ability to classify acoustic sounds is still unmatched compared to recent methods in machine learning. Psychoacoustic and physiological studies indicate that the auditory system of mammals decomposes audio signals into their acoustic and modulation frequency components prior to further analysis. Since it is known that most linguistic information is coded in amplitude fluctuations, mimicking temporal processing strategies of the auditory system in automatic speech recognition (ASR) promises to increase recognition accuracies. We present an amplitude modulation filter bank (AMFB) that is used as a feature extraction scheme in ASR systems. The time-frequency resolution of the employed FIR filters, i.e., bandwidth and modulation frequency settings, are adopted from a psychophysically inspired model of Dau (1997) that was originally proposed to describe data from human psychoacoustics. Investigations on modulation phase indicate the need for preserving such information in amplitude modulation features. We show that the filter symmetry has an important impact on ASR performance. The proposed feature extraction scheme exhibits significant word error rate (WER) reductions using the Aurora-2, Aurora-4, and REVERB ASR tasks compared to other recent feature extraction methods, such as MFCC, FDLP, and PNCC features. Thereby, AMFB features reveal high robustness against additive noise, different transmission channel characteristics, and room reverberation. Using the Aurora-4 benchmark, for instance, an average WER of 12.33% with raw and 11.31% with bottleneck transformed features is attained, which constitutes a relative improvement of 19.6% and 29.2% over raw MFCC features, respectively.

[1]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Sarel van Vuuren,et al.  Data-driven design of RASTA-like filters , 1997, EUROSPEECH.

[3]  Frédéric E. Theunissen,et al.  The Modulation Transfer Function for Speech Intelligibility , 2009, PLoS Comput. Biol..

[4]  Hermann Ney,et al.  Deep hierarchical bottleneck MRASTA features for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Sarel van Vuuren,et al.  Data based filter design for RASTA-like channel normalization in ASR , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Tim Jürgens,et al.  NOISE ROBUST DISTANT AUTOMATIC SPEECH RECOGNITION UTILIZING NMF BASED SOURCE SEPARATION AND AUDITORY FEATURE EXTRACTION , 2013 .

[10]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Haizhou Li,et al.  Normalization of the Speech Modulation Spectra for Robust Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[13]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[14]  Keith Vertanen Baseline Wsj Acoustic Models for Htk and Sphinx : Training Recipes and Recognition Experiments , 2007 .

[15]  B. Kollmeier,et al.  Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. , 1997, The Journal of the Acoustical Society of America.

[16]  Birger Kollmeier,et al.  Estimation of the signal-to-noise ratio with amplitude modulation spectrograms , 2002, Speech Commun..

[17]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[18]  Daniel P. W. Ellis,et al.  Frequency-domain linear prediction for temporal features , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[19]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[20]  Birger Kollmeier,et al.  Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[22]  G. Rose,et al.  Sensitivity to amplitude modulated sounds in the anuran auditory nervous system. , 1985, Journal of neurophysiology.

[23]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[24]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[25]  T. Yin,et al.  Responses to amplitude-modulated tones in the auditory nerve of the cat. , 1992, The Journal of the Acoustical Society of America.

[26]  D. Grantham,et al.  Modulation masking: effects of modulation frequency, depth, and phase. , 1989, The Journal of the Acoustical Society of America.

[27]  Jeih-Weih Hung,et al.  Optimization of temporal filters for constructing robust features in speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[29]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[30]  B. Kollmeier,et al.  Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. , 2012, The Journal of the Acoustical Society of America.

[31]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[32]  Birger Kollmeier,et al.  On the use of spectro-temporal features for the IEEE AASP challenge ‘detection and classification of acoustic scenes and events’ , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[33]  C. Schreiner,et al.  Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. , 2003, Journal of neurophysiology.

[34]  Hermann Ney,et al.  Context-Dependent MLPs for LVCSR: TANDEM, Hybrid or Both? , 2012, INTERSPEECH.

[35]  S. Furui,et al.  Speaker-independent isolated word recognition based on emphasized spectral dynamics , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  C E Schreiner,et al.  Neural processing of amplitude-modulated sounds. , 2004, Physiological reviews.

[37]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[38]  T. Houtgast Frequency selectivity in amplitude-modulation detection. , 1989, The Journal of the Acoustical Society of America.

[39]  B. Kollmeier,et al.  Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. , 1994, The Journal of the Acoustical Society of America.

[40]  C. Schreiner,et al.  Periodicity coding in the inferior colliculus of the cat. I. Neuronal mechanisms. , 1988, Journal of neurophysiology.

[41]  N. Viemeister Temporal modulation transfer functions based upon modulation thresholds. , 1979, The Journal of the Acoustical Society of America.

[42]  Hynek Hermansky,et al.  Temporal envelope compensation for robust phoneme recognition using modulation spectrum. , 2010, The Journal of the Acoustical Society of America.

[43]  E. Evans Place and time coding of frequency in the peripheral auditory system: some physiological pros and cons. , 1978, Audiology : official organ of the International Society of Audiology.

[44]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[45]  Jean-Marc Boite,et al.  Nonlinear discriminant analysis for improved speech recognition , 1997, EUROSPEECH.