Recognition of subsampled speech using a modified Mel filter bank

Several automatic speech recognition engines use Mel Frequency Cepstral Coefficients (MFCCs) features internally. Specifically, these features, extracted from speech, are used to build acoustic models in the form of hidden Markov models (HMMs). However, speech features depend on the sampling rate of the speech and subsequently acoustic models built using features extracted at a certain sampling rate cannot be used by a speech engine to recognize speech sampled at a different sampling rate. In this paper, we first derive a relationship between the MFCC features of the re-sampled speech and the MFCC features of the original sampled speech and propose a modified Mel filter bank so that the features extracted at different sampling frequencies are correlated. We show experimentally that the acoustic models built with speech sampled at one frequency can be used to recognize sub-sampled speech with high accuracies.

[1]  Qingyang Hong,et al.  Using Mel-Frequency Cepstral Coefficients in Missing Data Technique , 2004, EURASIP J. Adv. Signal Process..

[2]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[3]  Ben P. Milner,et al.  Robust Acoustic Speech Feature Prediction From Noisy Mel-Frequency Cepstral Coefficients , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[5]  Nelson Morgan,et al.  Deep and Wide: Multiple Layers in Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Sunil Kumar Kopparapu,et al.  Choice of Mel filter bank in computing MFCC of a resampled speech , 2010, 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010).

[7]  I. Elamvazuthi,et al.  Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.

[8]  Ulrich Kornagel,et al.  Techniques for artificial bandwidth extension of telephone speech , 2006, Signal Process..

[9]  Saeed Vaseghi,et al.  Applying noise compensation methods to robustly predict acoustic speech features from MFCC vectors in noise , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[11]  Longbiao Wang,et al.  Speaker identification by combining MFCC and phase information in noisy environments , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Satoshi Nakamura,et al.  Automatic Speech Recognition , 2011 .

[13]  Ben P. Milner,et al.  Reconstructing clean speech from noisy MFCC vectors , 2009, INTERSPEECH.

[14]  Santosh Chapaneri,et al.  Spoken Digits Recognition using Weighted MFCC and Improved Features for Dynamic Time Warping , 2012 .

[15]  Alan V. Oppenheim,et al.  Discrete-Time Signal Pro-cessing , 1989 .