Spectral-temporal receptive fields and MFCC balanced feature extraction for noisy speech recognition

This paper aims to propose a new set of acoustic features based on spectral-temporal receptive fields (STRFs). The STRF is an analysis method for studying physiological model of the mammalian auditory system in spectral-temporal domain. It has two different parts: one is the rate (in Hz) which represents the temporal response and the other is the scale (in cycle/octave) which represents the spectral response. With the obtained STRF, we propose an effective acoustic feature. First, the energy of each scale is calculated from the STRF. The logarithmic operation is then imposed on the scale energies. Finally, the discrete Cosine transform is applied to generate the proposed STRF feature. In our experiments, we combine the proposed STRF feature with conventional Mel frequency cepstral coefficients (MFCCs) to verify its effectiveness. In a noise-free environment, the proposed feature can increase the recognition rate by 17.48%. Moreover, the increase in the recognition rate ranges from 5% to 12% in noisy environments.

[1]  T. Petersen,et al.  Critical band analysis-synthesis , 1983 .

[2]  Wei Hou,et al.  Duration weighted Gaussian Mixture Model supervector modeling for robust speaker recognition , 2013, 2013 Ninth International Conference on Natural Computation (ICNC).

[3]  Yi Hu,et al.  Speech enhancement based on wavelet thresholding the multitaper spectrum , 2004, IEEE Transactions on Speech and Audio Processing.

[4]  Lukás Burget,et al.  Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Jing Wang,et al.  A new noisy speech recognition method , 2005, IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005..

[6]  Chih-Ta Yen,et al.  Enhancing GMM speaker identification by incorporating SVM speaker verification for intelligent web-based speech applications , 2013, Multimedia Tools and Applications.

[7]  Chang-Hong Lin,et al.  Robust Environmental Sound Recognition With Fast Noise Suppression for Home Automation , 2015, IEEE Transactions on Automation Science and Engineering.

[8]  Chung-Hsien Yang,et al.  Robust Speaker Identification and Verification , 2007, IEEE Computational Intelligence Magazine.

[9]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Bin Ma,et al.  Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Werner Hemmert,et al.  Automatic speech recognition with an adaptation model motivated by auditory processing , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  M. Demirekler,et al.  Comparison of parametric and non-parametric representations of speech for recognition , 1994, Proceedings of MELECON '94. Mediterranean Electrotechnical Conference.

[13]  Biing-Hwang Juang,et al.  Speech Analysis in a Model of the Central Auditory System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[15]  Aaron E. Rosenberg,et al.  Cepstral channel normalization techniques for HMM-based speaker verification , 1994, ICSLP.

[16]  Dilip Sarkar,et al.  Randomness in generalization ability: a source to improve it , 1996, IEEE Trans. Neural Networks.

[17]  A. Hussain,et al.  Decision Fusion for Isolated Malay Digit Recognition Using Dynamic Time Warping (DTW) and Hidden Markov Model (HMM) , 2007, 2007 5th Student Conference on Research and Development.

[18]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Jagannath H. Nirmal,et al.  A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network , 2015, 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR).

[20]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[21]  C. Schreiner,et al.  Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. , 2003, Journal of neurophysiology.

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[24]  Jia-Ching Wang,et al.  VLSI Design for SVM-Based Speaker Verification System , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[25]  Jeih-Weih Hung,et al.  DCT-based processing of dynamic features for robust speech recognition , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[26]  Liqing Zhang,et al.  Robust Multifactor Speech Feature Extraction Based on Gabor Analysis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Biing-Hwang Juang,et al.  The past, present, and future of speech processing , 1998, IEEE Signal Process. Mag..

[28]  Bin Ma,et al.  The RSR2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases , 2012, Interspeech 2012.

[29]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[30]  Didier Meuwly,et al.  Forensic speaker recognition based on a Bayesian framework and Gaussian mixture modelling (GMM) , 2001, Odyssey.

[31]  Tony Ezzat,et al.  Spectro-temporal analysis of speech using 2-d Gabor filters , 2007, INTERSPEECH.

[32]  Jhing-Fa Wang,et al.  Chip design of MFCC extraction for speech recognition , 2002, Integr..

[33]  T. Madhu,et al.  Investigation of Decision Tree Induction, Probabilistic Technique and SVM For Speaker Identification , 2013 .

[34]  A. Alcaim,et al.  LSF and LPC - derived features for Large Vocabulary Distributed Continuous Speech Recognition in Brazilian Portuguese , 2008, 2008 42nd Asilomar Conference on Signals, Systems and Computers.

[35]  Tai-Shih Chi,et al.  Spectro-temporal modulation energy based mask for robust speaker identification. , 2012, The Journal of the Acoustical Society of America.

[36]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[37]  Yariv Ephraim Gain-adapted hidden Markov models for recognition of clean and noisy speech , 1992, IEEE Trans. Signal Process..

[38]  Donghui Guo,et al.  Speaker recognition using weighted dynamic MFCC based on GMM , 2010, 2010 International Conference on Anti-Counterfeiting, Security and Identification.

[39]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[40]  K. I. Ramachandran,et al.  Towards improving the performance of text/language independent speaker recognition systems , 2014, 2014 International Conference on Power Signals Control and Computations (EPSCICON).

[41]  Chang-Hong Lin,et al.  Speaker Identification With Whispered Speech for the Access Control System , 2015, IEEE Transactions on Automation Science and Engineering.

[42]  Miguel A. Ferrer,et al.  Influence of initialisation and stop criteria on HMM based recognisers , 2000 .

[43]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[44]  Ta-Wen Kuan,et al.  VLSI Design of an SVM Learning Core on Sequential Minimal Optimization Algorithm , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[45]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .