A unified framework for filterbank and time-frequency basis vectors in ASR frontends

For many years, filterbank have been widely used as one step of frontend feature extraction for Automatic Speech Recognition (ASR). In this paper, we propose a unified framework for ASR frontends, by first moving the nonlinear amplitude scaling, and then combining the filterbank weights with the cosine basis vectors. As part of this framework, we also show that the delta terms used to encode feature dynamics can also be viewed as one realization of a set of “unified” basis vectors over time. With this framework, frontends can be developed, analyzed and evaluated through their basis vectors over frequency and time.

[1]  H. Duifhuis Consequences of peripheral frequency selectivity for nonsimultaneous masking. , 1973, The Journal of the Acoustical Society of America.

[2]  Michael Kleinschmidt Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[3]  S. Dharanipragada,et al.  Feature extraction for robust speech recognition , 2002, 2002 IEEE International Symposium on Circuits and Systems. Proceedings (Cat. No.02CH37353).

[4]  Richard M. Stern,et al.  Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction , 2009, INTERSPEECH.

[5]  Zhenjiang Miao,et al.  Differential MFCC and Vector Quantization Used for Real-Time Speaker Recognition System , 2008, 2008 Congress on Image and Signal Processing.

[6]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[7]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[8]  M. Mizumachi,et al.  Robust MFCCs Derived from Differentiated Power Spectrum , 2005 .

[9]  Fabio Valente,et al.  Hierarchical and parallel processing of modulation spectrum for ASR applications , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Abeer Alwan,et al.  A pitch-based spectral enhancement technique for robust speech processing , 2013, INTERSPEECH.

[11]  Margaret Lech,et al.  Speaker Verification Based on Different Vector Quantization Techniques with Gaussian Mixture Models , 2009, 2009 Third International Conference on Network and System Security.

[12]  Wangning Ge Two modified methods of feature extraction for automatic speech recognition , 2013 .

[13]  Mark A Gregory,et al.  A novel approach for MFCC feature extraction , 2010, 2010 4th International Conference on Signal Processing and Communication Systems.

[14]  L. Carney,et al.  A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression. , 2001, The Journal of the Acoustical Society of America.

[15]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Jiang Wu,et al.  Spectral and temporal modulation features for phonetic recognition , 2009, INTERSPEECH.

[17]  H.S. Jayanna,et al.  Fuzzy Vector Quantization for speaker recognition under limited data conditions , 2008, TENCON 2008 - 2008 IEEE Region 10 Conference.

[18]  Wu Junqin,et al.  An improved arithmetic of MFCC in speech recognition system , 2011, 2011 International Conference on Electronics, Communications and Control (ICECC).

[19]  Gavin M. Bidelman,et al.  Spectrotemporal resolution tradeoff in auditory processing as revealed by human auditory brainstem responses and psychophysical indices , 2014, Neuroscience Letters.

[20]  Marc René Schädler,et al.  Comparing Different Flavors of Spectro-Temporal Features for ASR , 2011, INTERSPEECH.