ASR feature extraction with morphologically-filtered power-normalized cochleograms

In this paper we present advances in the modeling of the masking behavior of the Human Auditory System to enhance the robustness of the feature extraction stage in Automatic Speech Recognition. The solution adopted is based on a non-linear filtering of a spectro-temporal representation applied simultaneously on both the frequency and time domains, by processing it using mathematical morphology operations as if it were an image. A particularly important component of this architecture is the so called structuring element: biologically-based considerations are addressed in the present contribution to design an element that closely resembles the masking phenomena taking place in the cochlea. The second feature of this contribution is the choice of underlying spectro-temporal representation. The best results were achieved by the representation introduced as part of the Power Normalized Cepstral Coefficients together with a spectral subtraction step. On the Aurora 2 noisy continuous digits task, we report relative error reductions of 18.7% compared to PNCC and 39.5% compared to MFCC.

[1]  W. Jesteadt,et al.  Forward masking as a function of frequency, masker level, and signal delay. , 1982, The Journal of the Acoustical Society of America.

[2]  Birger Kollmeier,et al.  Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition , 2011, Speech Commun..

[3]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[4]  Georg v. Békésy,et al.  On the Resonance Curve and the Decay Period at Various Points on the Cochlear Partition , 1949 .

[5]  Marc René Schädler,et al.  Comparing Different Flavors of Spectro-Temporal Features for ASR , 2011, INTERSPEECH.

[6]  G. Matheron,et al.  THE BIRTH OF MATHEMATICAL MORPHOLOGY , 2002 .

[7]  B. Moore,et al.  A revised model of loudness perception applied to cochlear hearing loss , 2004, Hearing Research.

[8]  K.K. Paliwal,et al.  Auditory masking based acoustic front-end for robust speech recognition , 1997, TENCON '97 Brisbane - Australia. Proceedings of IEEE TENCON '97. IEEE Region 10 Annual Conference. Speech and Image Technologies for Computing and Telecommunications (Cat. No.97CH36162).

[9]  Francisco J. Valverde-Albacete,et al.  Auditory-Inspired Morphological Processing of Speech Spectrograms: Applications in Automatic Speech Recognition and Speech Enhancement , 2013, Cognitive Computation.

[10]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[11]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[12]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[13]  Edward R. Dougherty,et al.  Hands-on Morphological Image Processing , 2003 .

[14]  Carmen Peláez-Moreno,et al.  Morphological Processing of Spectrograms for Speech Enhancement , 2011, NOLISP.

[15]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[16]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[17]  Diego H. Milone,et al.  Bioinspired sparse spectro-temporal representation of speech for robust classification , 2012, Comput. Speech Lang..

[18]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  E Zwicker,et al.  Inverse frequency dependence of simultaneous tone-on-tone masking patterns at low levels. , 1982, The Journal of the Acoustical Society of America.

[20]  Richard M. Stern,et al.  Hearing Is Believing: Biologically Inspired Methods for Robust Automatic Speech Recognition , 2012, IEEE Signal Processing Magazine.

[21]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[22]  John H. L. Hansen,et al.  Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect , 1994, IEEE Trans. Speech Audio Process..

[23]  Volker Hohmann,et al.  Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency , 2011, Speech Commun..

[24]  Martin Heckmann,et al.  A hierarchical framework for spectro-temporal feature extraction , 2011, Speech Commun..

[25]  Serajul Haque Utilizing auditory masking in automatic speech recognition , 2010, 2010 International Conference on Audio, Language and Image Processing.

[26]  Tuomas Virtanen,et al.  Modelling spectro-temporal dynamics in factorisation-based noise-robust automatic speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  L. Carney,et al.  A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression. , 2001, The Journal of the Acoustical Society of America.

[28]  Yi Hu,et al.  Incorporating a psychoacoustical model in frequency domain speech enhancement , 2004, IEEE Signal Processing Letters.

[29]  Birger Kollmeier,et al.  Hooking up spectro-temporal filters with auditory-inspired representations for robust automatic speech recognition , 2012, INTERSPEECH.

[30]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[31]  Richard M. Stern,et al.  Physiologically-motivated synchrony-based processing for robust automatic speech recognition , 2006, INTERSPEECH.