A Front-End Technique for Automatic Noisy Speech Recognition

The sounds in a real environment not often take place in isolation because sounds are building complex and usually happen concurrently. Auditory masking relates to the perceptual interaction between sound components. This paper proposes modeling the effect of simultaneous masking into the Mel frequency cepstral coefficient (MFCC) and effectively improve the performance of the resulting system. Moreover, the Gammatone frequency integration is presented to warp the energy spectrum which can provide gradually decaying the weights and compensate for the loss of spectral correlation. Experiments are carried out on the Aurora-2 database, and frame-level cross entropy-based deep neural network (DNN-HMM) training is used to build an acoustic model. While given models trained on multi-condition speech data, the accuracy of our proposed feature extraction method achieves up to 98.14% in case of 10dB, 94.40% in 5dB, 81.67% in 0dB and 51.5% in −5dB, respectively.

[1]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[2]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[3]  Y. Miyanaga,et al.  Robust speech recognition with feature extraction using combined method of RSF and DRA , 2004, IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004..

[4]  John H. L. Hansen,et al.  A Review on Speech Recognition Technique , 2010 .

[5]  Marco Matassoni,et al.  A perceptual masking approach for noise robust speech recognition , 2012, EURASIP J. Audio Speech Music. Process..

[6]  Rishi Pal Singh,et al.  Automatic Speech Recognition: A Review , 2012 .

[7]  Yi Jiang,et al.  Auditory features based on Gammatone filters for robust speech recognition , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[8]  Waleed H. Abdulla,et al.  Audio Watermark: A Comprehensive Foundation Using MATLAB , 2014 .

[9]  Hafizah Husain,et al.  Mel frequency cepstral coefficients (Mfcc) feature extraction enhancement in the application of speech recognition: A comparison study , 2015 .

[10]  Hilman Ferdinandus Pardede On noise robust feature for speech recognition based on power function family , 2015, 2015 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS).

[11]  D Darabian,et al.  Improving the performance of MFCC for Persian robust speech recognition , 2015 .

[12]  Omar Farooq,et al.  Speaker adaptive model for Hindi speech using Kaldi speech recognition toolkit , 2017, 2017 International Conference on Multimedia, Signal Processing and Communication Technologies (IMPACT).

[13]  Hari Krishna Vydana,et al.  DNN-HMM Acoustic Modeling for Large Vocabulary Telugu Speech Recognition , 2017, MIKE.

[14]  Horia Cucu,et al.  SpeeD's DNN approach to Romanian speech recognition , 2017, 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD).

[15]  K. K. Tomchuk Spectral Masking in MFCC Calculation for Noisy Speech , 2018, 2018 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF).

[16]  Hamurabi Gamboa Rosales,et al.  Robust Recognition of English Speech in Noisy Environments Using Frequency Warped Signal Processing , 2018, National Academy Science Letters.

[17]  Gabrielle K. Liu Evaluating Gammatone Frequency Cepstral Coefficients with Neural Networks for Emotion Recognition from Speech , 2018, ArXiv.

[18]  S. Shahnawazuddin,et al.  Enhancing Pitch Robustness of Speech Recognition System through Spectral Smoothing , 2018, 2018 International Conference on Signal Processing and Communications (SPCOM).

[19]  Hay Mar Soe Naing,et al.  PSYCHOACOUSTICAL MASKING EFFECT-BASED FEATURE EXTRACTION FOR ROBUST SPEECH RECOGNITION , 2019 .