Improving native accent identification using deep neural networks

In this paper, we utilize deep neural networks(DNNs) to automatically identify native accents in English and Mandarin when no text, speaker or gender information is available for the speech data. Compared to the Gaussian mixture model(GMM) based conventional methods, the proposed method benefits from two main advantages: first, DNNs are discriminative models which can provide better discrimination on confusion regions of different accents; second, they have the hierarchical nonlinear feature extraction capability which can learn discriminative high-level features for the specified task. In detail, the speech data of all accents is used to train DNNs, and in the testing stage, we first identify the accent label of each frame, then determine the sentence label by the majority voting conducted on the frame labels. The experiments on accented English and Mandarin corpus demonstrate that, compared to the GMM based methods, our proposed method can significantly improve the frame accuracy as well as sentence accuracy on the test set. Moreover, the performance of the proposed method can be further improved by using context information.

[1]  Yang Liu,et al.  Syllable-level prominence detection with acoustic evidence , 2010, INTERSPEECH.

[2]  Tao Chen,et al.  Accent Issues in Large Vocabulary Continuous Speech Recognition , 2004, Int. J. Speech Technol..

[3]  Zhi-Jie Yan,et al.  A scalable approach to using DNN-derived features in GMM-HMM based acoustic modeling for LVCSR , 2013, INTERSPEECH.

[4]  Joseph P. Campbell,et al.  A linguistically-informative approach to dialect recognition using dialect-discriminating context-dependent phonetic models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Geoffrey Zweig,et al.  An empirical study of automatic accent classification , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[7]  Pascale Fung,et al.  Fast accent identification and accented speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  J. Hansen,et al.  Dialect Classification via Text-Independent Training and Testing for Arabic, Spanish, and Chinese , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[10]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[11]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[12]  John H. L. Hansen,et al.  Advances in phone-based modeling for automatic accent classification , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[14]  John H. L. Hansen,et al.  Unsupervised Discriminative Training With Application to Dialect Classification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[16]  Bo Xu,et al.  From English pitch accent detection to Mandarin stress detection, where is the difference? , 2012, Comput. Speech Lang..

[17]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.