Morphologically Filtered Power-Normalized Cochleograms as Robust, Biologically Inspired Features for ASR

In this paper, we present advances in the modeling of the masking behavior of the human auditory system (HAS) to enhance the robustness of the feature extraction stage in automatic speech recognition (ASR). The solution adopted is based on a nonlinear filtering of a spectro-temporal representation applied simultaneously to both frequency and time domains - as if it were an image - using mathematical morphology operations. A particularly important component of this architecture is the so-called structuring element (SE) that in the present contribution is designed as a single three-dimensional pattern using physiological facts, in such a way that closely resembles the masking phenomena taking place in the cochlea. A proper choice of spectro-temporal representation lends validity to the model throughout the whole frequency spectrum and intensity spans assuming the variability of the masking properties of the HAS in these two domains. The best results were achieved with the representation introduced as part of the power normalized cepstral coefficients (PNCC) together with a spectral subtraction step. This method has been tested on Aurora 2, Wall Street Journal and ISOLET databases including both classical hidden Markov model (HMM) and hybrid artificial neural networks (ANN)-HMM back-ends. In these, the proposed front-end analysis provides substantial and significant improvements compared to baseline techniques: up to 39.5% relative improvement compared to MFCC, and 18.7% compared to PNCC in the Aurora 2 database.

[1]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[2]  Serajul Haque Utilizing auditory masking in automatic speech recognition , 2010, 2010 International Conference on Audio, Language and Image Processing.

[3]  Ian C. Bruce,et al.  Auditory nerve model for predicting performance limits of normal and impaired listeners , 2001 .

[4]  B. Moore,et al.  Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. , 1983, The Journal of the Acoustical Society of America.

[5]  W. A. Mvnso,et al.  Loudness , Its Definition , Measurement and Calculation , 2004 .

[6]  Hynek Hermansky,et al.  Perceptually based linear predictive analysis of speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Mark Weiser,et al.  Source Code , 1987, Computer.

[8]  Ron Cole,et al.  The ISOLET spoken letter database , 1990 .

[9]  Yi Hu,et al.  Incorporating a psychoacoustical model in frequency domain speech enhancement , 2004, IEEE Signal Processing Letters.

[10]  Steve Young,et al.  The HTK book , 1995 .

[11]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[12]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[13]  Birger Kollmeier,et al.  Hooking up spectro-temporal filters with auditory-inspired representations for robust automatic speech recognition , 2012, INTERSPEECH.

[14]  Martin Heckmann,et al.  A hierarchical framework for spectro-temporal feature extraction , 2011, Speech Commun..

[15]  Edward R. Dougherty,et al.  Hands-on Morphological Image Processing , 2003 .

[16]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[17]  Diego H. Milone,et al.  Bioinspired sparse spectro-temporal representation of speech for robust classification , 2012, Comput. Speech Lang..

[18]  Francisco J. Valverde-Albacete,et al.  ASR feature extraction with morphologically-filtered power-normalized cochleograms , 2014, INTERSPEECH.

[19]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  E. Zwicker,et al.  Analytical expressions for critical‐band rate and critical bandwidth as a function of frequency , 1980 .

[21]  E Zwicker,et al.  Inverse frequency dependence of simultaneous tone-on-tone masking patterns at low levels. , 1982, The Journal of the Acoustical Society of America.

[22]  Richard M. Stern,et al.  Hearing Is Believing: Biologically Inspired Methods for Robust Automatic Speech Recognition , 2012, IEEE Signal Processing Magazine.

[23]  John H. L. Hansen,et al.  Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect , 1994, IEEE Trans. Speech Audio Process..

[24]  Richard F. Lyon,et al.  A computational model of filtering, detection, and compression in the cochlea , 1982, ICASSP.

[25]  Hervé Bourlard,et al.  Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions , 1997, Summer School on Neural Networks.

[26]  William M. Hartmann,et al.  Psychoacoustics: Facts and Models , 2001 .

[27]  Tuomas Virtanen,et al.  Modelling spectro-temporal dynamics in factorisation-based noise-robust automatic speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Birger Kollmeier,et al.  Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Stephanie Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[30]  Roy D. Patterson,et al.  Auditory images:How complex sounds are represented in the auditory system , 2000 .

[31]  Richard F. Lyon A computational model of binaural localization and separation , 1983, ICASSP.

[32]  Keith Vertanen Baseline Wsj Acoustic Models for Htk and Sphinx : Training Recipes and Recognition Experiments , 2007 .

[33]  Marc René Schädler,et al.  Comparing Different Flavors of Spectro-Temporal Features for ASR , 2011, INTERSPEECH.

[34]  Shantanu Chakrabartty,et al.  Sparse Auditory Reproducing Kernel (SPARK) Features for Noise-Robust Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[36]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[37]  Alfred Mertins,et al.  Contextual invariant-integration features for improved speaker-independent speech recognition , 2011, Speech Commun..

[38]  Birger Kollmeier,et al.  Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition , 2011, Speech Commun..

[39]  Oded Ghitza,et al.  Auditory nerve representation as a front-end for speech recognition in a noisy environment , 1986 .

[40]  W. Jesteadt,et al.  Forward masking as a function of frequency, masker level, and signal delay. , 1982, The Journal of the Acoustical Society of America.

[41]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[42]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[43]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[44]  J. Allen,et al.  Cochlear modeling , 1985, IEEE ASSP Magazine.

[45]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[46]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[47]  Carmen Peláez-Moreno,et al.  Morphological Processing of Spectrograms for Speech Enhancement , 2011, NOLISP.

[48]  K.K. Paliwal,et al.  Auditory masking based acoustic front-end for robust speech recognition , 1997, TENCON '97 Brisbane - Australia. Proceedings of IEEE TENCON '97. IEEE Region 10 Annual Conference. Speech and Image Technologies for Computing and Telecommunications (Cat. No.97CH36162).

[49]  Francisco J. Valverde-Albacete,et al.  Auditory-Inspired Morphological Processing of Speech Spectrograms: Applications in Automatic Speech Recognition and Speech Enhancement , 2013, Cognitive Computation.

[50]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[51]  Jan Van der Spiegel,et al.  Robust auditory-based speech processing using the average localized synchrony detection , 2002, IEEE Trans. Speech Audio Process..

[52]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[53]  Georg v. Békésy,et al.  On the Resonance Curve and the Decay Period at Various Points on the Cochlear Partition , 1949 .

[54]  Hazarathaiah Malepati,et al.  Speech and Audio Processing , 2010 .

[55]  G. Matheron,et al.  THE BIRTH OF MATHEMATICAL MORPHOLOGY , 2002 .

[56]  B. Moore,et al.  A revised model of loudness perception applied to cochlear hearing loss , 2004, Hearing Research.

[57]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.