Distant-talking accent recognition by combining GMM and DNN

Recently, automatic accent recognition has been paid more and more attentions. However, there are few researches focusing on accent recognition in distant-talking environment which is very important for improving distant-talking speech recognition performance with non-native accents. In this paper, we apply Gaussian Mixture Models (GMM) and Deep Neural Network (DNN) to identify the speaker accent in reverberant environments. The combination of likelihood with these two approaches is also proposed. In reverberant environment, the accent recognition rate was improved from 90.7 % with GMM to 93.0 % with DNN. The combination of GMM and DNN achieved recognition rate of 97.5 %, which outperformed than the individual GMM and DNN because the complementation of GMM and DNN. The relative error reduction is 73.1 % than the GMM-based method and 64.3 % than the DNN-based method, respectively.

[1]  DeLiang Wang,et al.  A two-stage algorithm for one-microphone reverberant speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[3]  Roland Maas,et al.  Reverberation Model-Based Decoding in the Logmelspec Domain for Robust Distant-Talking Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  D. Fohr,et al.  Text-Independent Foreign Accent Classification using Statistical Methods , 2007, 2007 IEEE International Conference on Signal Processing and Communications.

[5]  Geoffrey Zweig,et al.  An empirical study of automatic accent classification , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Venu Govindaraju,et al.  Accent classification in speech , 2005, Fourth IEEE Workshop on Automatic Identification Advanced Technologies (AutoID'05).

[7]  Chao Huang,et al.  Automatic accent identification using Gaussian mixture models , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[8]  Joseph P. Campbell,et al.  A linguistically-informative approach to dialect recognition using dialect-discriminating context-dependent phonetic models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[10]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[11]  Lin-Shan Lee,et al.  Pronunciation variation analysis based on acoustic and phonemic distance measures with application examples on Mandarin Chinese , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[12]  Longbiao Wang,et al.  Improvement of distant-talking speaker identification using bottleneck features of DNN , 2013, INTERSPEECH.

[13]  Hans-Günter Hirsch,et al.  A new approach for the adaptation of HMMs to reverberation and background noise , 2008, Speech Commun..

[14]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Longbiao Wang,et al.  Robust Distant Speech Recognition by Combining Multiple Microphone-Array Processing with Position-Dependent CMN , 2006, EURASIP J. Adv. Signal Process..

[16]  Longbiao Wang,et al.  Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification , 2015, EURASIP J. Audio Speech Music. Process..

[17]  Keikichi Hirose,et al.  Measurement of Objective Intelligibility of Japanese Accented English Using ERJ (English Read by Japanese) Database , 2011, INTERSPEECH.

[18]  P. Mermelstein,et al.  Effects of speaker accent on the performance of a speaker-independent, isolated-word recognizer , 1982 .

[19]  T. Yoshioka,et al.  Environmentally robust ASR front-end for deep neural network acoustic models , 2015, Comput. Speech Lang..

[20]  Steve Young,et al.  The HTK book , 1995 .

[21]  Longbiao Wang,et al.  Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[22]  Longbiao Wang,et al.  Distant-talking speaker identification by generalized spectral subtraction-based dereverberation and its efficient computation , 2014, EURASIP Journal on Audio, Speech, and Music Processing.

[23]  Longbiao Wang,et al.  Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization , 2014, Journal of Signal Processing Systems.

[24]  Satoshi Nakamura,et al.  Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition , 2000, LREC.

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Satoshi Nakamura,et al.  Evaluation Framework for Distant-talking Speech Recognition under Reverberant Environments: newest Part of the CENSREC Series - , 2008, LREC.

[27]  Longbiao Wang,et al.  Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM , 2007, Speech Commun..

[28]  John H. L. Hansen,et al.  Language accent classification in American English , 1996, Speech Commun..

[29]  Longbiao Wang,et al.  Dereverberation and denoising based on generalized spectral subtraction by multi-channel LMS algorithm using a small-scale microphone array , 2012, EURASIP Journal on Advances in Signal Processing.

[30]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[31]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..