Mel frequency cepstral coefficients (Mfcc) feature extraction enhancement in the application of speech recognition: A comparison study

Mel Frequency Cepstral Coefficients (MFCCs) are the most widely used features in the majority of the speaker and speech recognition applications. Since 1980s, remarkable efforts have been undertaken for the development of these features. Issues such as use suitable spectral estimation methods, design of effective filter banks, and the number of chosen features all play an important role in the performance and robustness of the speech recognition systems. This paper provides an overview of MFCC's enhancement techniques that are applied in speech recognition systems. The details such as accuracy, types of environments, the nature of data, and the number of features are investigated and summarized in the table combined with the corresponding key references. Benefits and drawbacks of these MFCC's enhancement techniques have been discussed. This study will hopefully contribute to raising initiatives towards the enhancement of MFCC in terms of robustness features, high accuracy, and less complexity.

[1]  Say Wei Foo,et al.  Classification of stress in speech using linear and nonlinear features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  Mayank Dave,et al.  Filterbank optimization for robust ASR using GA and PSO , 2012, Int. J. Speech Technol..

[3]  Pham Viet Binh,et al.  A new wavelet-based wide-band speech coder , 2008, 2008 International Conference on Advanced Technologies for Communications.

[4]  Hermann Ney,et al.  Using multiple acoustic feature sets for speech recognition , 2007, Speech Commun..

[5]  Soo-Young Lee,et al.  An engineering model of the masking for the noise-robust speech recognition , 2003, Neurocomputing.

[6]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[7]  Lin-Shan Lee,et al.  Improved MFCC feature extraction by PCA-optimized filter-bank for speech recognition , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[8]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[9]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[10]  Abeer Alwan,et al.  A model of dynamic auditory perception and its application to robust word recognition , 1997, IEEE Trans. Speech Audio Process..

[11]  Jeill-weill Hllng OPTIMIZATION OF FILTER·BANK TO IMPROVE THE EXTRACTION OF MFCC FEATURES IN SPEECH RECOGNITION , 2004 .

[12]  Daniel P. W. Ellis,et al.  Speech and Audio Signal Processing - Processing and Perception of Speech and Music, Second Edition , 1999 .

[13]  Liang Gu,et al.  Split-band perceptual harmonic cepstral coefficients as acoustic features for speech recognition , 2001, INTERSPEECH.

[14]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[15]  J. N. Gowdy,et al.  Feature extraction using discrete wavelet transform for speech recognition , 2000, Proceedings of the IEEE SoutheastCon 2000. 'Preparing for The New Millennium' (Cat. No.00CH37105).

[16]  M. A. Anusuya,et al.  Front end analysis of speech recognition: a review , 2011, Int. J. Speech Technol..

[17]  Ing Yann Soon,et al.  A temporal frequency warped (TFW) 2D psychoacoustic filter for robust speech recognition system , 2012, Speech Commun..

[18]  Srinivasan Umesh,et al.  A shift-based approach to speaker normalization using non-linear frequency-scaling model , 2008, Speech Commun..

[19]  Navnath S. Nehe,et al.  Mel Frequency Teager Energy Features for Isolate Word Recognition in Noisy Environment , 2009, 2009 Second International Conference on Emerging Trends in Engineering & Technology.

[20]  Sarah Hawkins,et al.  Temporal integration in the perception of speech: introduction , 2003, J. Phonetics.

[21]  Kuldip K. Paliwal,et al.  Usefulness of phase spectrum in human speech perception , 2003, INTERSPEECH.

[22]  Yifan Gong,et al.  A minimum-mean-square-error noise reduction algorithm on Mel-frequency cepstra for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Kuldip K. Paliwal,et al.  MFCC computation from magnitude spectrum of higher lag autocorrelation coefficients for robust speech recognition , 2004, INTERSPEECH.

[24]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[25]  曹志刚,et al.  Improved MFCC-Based Feature for Robust Speaker Identification , 2005 .

[26]  Biing-Hwang Juang,et al.  An application of discriminative feature extraction to filter-bank-based speech recognition , 2001, IEEE Trans. Speech Audio Process..

[27]  Ing Yann Soon,et al.  2D psychoacoustic filtering for robust speech recognition , 2009, 2009 7th International Conference on Information, Communications and Signal Processing (ICICS).

[28]  Yifan Gong,et al.  Robust Speech Recognition Using a Cepstral Minimum-Mean-Square-Error-Motivated Noise Suppressor , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Hui Gao,et al.  Emotion Classification of Infant Voice Based on Features Derived from Teager Energy Operator , 2008, 2008 Congress on Image and Signal Processing.

[30]  Petros Maragos,et al.  Auditory Teager energy cepstrum coefficients for robust speech recognition , 2005, INTERSPEECH.

[31]  Ing Yann Soon,et al.  A temporal warped 2D psychoacoustic modeling for robust speech recognition system , 2011, Speech Commun..

[32]  S. Shamma Speech processing in the auditory system. II: Lateral inhibition and the central processing of speech evoked activity in the auditory nerve. , 1985, The Journal of the Acoustical Society of America.

[33]  Navnath S. Nehe,et al.  Isolated Word Recognition Using Normalized Teager Energy Cepstral Features , 2009, 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies.

[34]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[35]  Jie Zhang,et al.  A Novel Noise-Robust Speech Recognition System Based on Adaptively Enhanced Bark Wavelet MFCC , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[36]  Xiangdong Liu,et al.  Comparative Research on Particle Swarm Optimization and Genetic Algorithm , 2010, Comput. Inf. Sci..

[37]  Douglas D. O'Shaughnessy,et al.  Speech recognition using regularized minimum variance distortionless response spectrum estimation-based cepstral features , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Mazin G. Rahim,et al.  On second order statistics and linear estimation of cepstral coefficients , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[39]  Mohammad Mehdi Homayounpour,et al.  Autocorrelation-based Methods for Noise- Robust Speech Recognition , 2007 .

[40]  John H. L. Hansen,et al.  A new perspective on feature extraction for robust in-vehicle speech recognition , 2003, INTERSPEECH.

[41]  Zhenyang Wu,et al.  Maximum likelihood subband polynomial regression for robust speech recognition , 2013 .

[42]  A. Oxenham,et al.  Forward masking: adaptation or integration? , 2001, The Journal of the Acoustical Society of America.

[43]  Srinivasan Umesh,et al.  VTLN Using Analytically Determined Linear-Transformation on Conventional MFCC , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Xiong Xiao,et al.  Robust speech features and acoustic models for speech recognition , 2009 .

[45]  Jing Bai,et al.  The Speech Recognition System Based On Bark Wavelet MFCC , 2006, 2006 8th international Conference on Signal Processing.

[46]  J. F. Kaiser,et al.  On a simple algorithm to calculate the 'energy' of a signal , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[47]  Bhaskar D. Rao,et al.  All-pole modeling of speech based on the minimum variance distortionless response spectrum , 2000, Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136).

[48]  Kuldip K. Paliwal,et al.  Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition , 2006, Speech Commun..

[49]  Ludek Müller,et al.  Comparison of MFCC and PLP parameterizations in the speaker independent continuous speech recognition task , 2001, INTERSPEECH.

[50]  Kuldip K. Paliwal,et al.  Product of power spectrum and group delay function for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51]  Bhaskar D. Rao,et al.  MVDR based feature extraction for robust speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[52]  C. Schreiner,et al.  Short-term adaptation of auditory receptive fields to dynamic stimuli. , 2004, Journal of neurophysiology.

[53]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[54]  Satya Dharanipragada,et al.  Perceptual MVDR-based cepstral coefficients (PMCCs) for robust speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[55]  Hynek Hermansky,et al.  Phase AutoCorrelation (PAC) features for noise robust speech recognition , 2012, Speech Commun..

[56]  Ahmad Akbari,et al.  SNR-dependent compression of enhanced Mel sub-band energies for compensation of noise effects on MFCC features , 2007, Pattern Recognit. Lett..

[57]  Don H. Johnson,et al.  Estimation of all-pole model parameters from noise-corrupted sequences , 1989, IEEE Trans. Acoust. Speech Signal Process..

[58]  Hema A. Murthy,et al.  The modified group delay function and its application to phoneme recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[59]  Pravin Yannawar,et al.  Feature extraction using fusion MFCC for continuous marathi speech recognition , 2011, 2011 Annual IEEE India Conference.

[60]  Petros Maragos,et al.  Time-frequency distributions for automatic speech recognition , 2001, IEEE Trans. Speech Audio Process..

[61]  Er Meng Joo,et al.  Using sub-band wavelet packets strategy for feature extraction , 2003 .

[62]  Ing Yann Soon,et al.  An auditory model for robust speech recognition , 2008, 2008 International Conference on Audio, Language and Image Processing.

[63]  Hervé Bourlard,et al.  Phase autocorrelation (PAC) derived robust speech features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..