Mandarin tone modeling using recurrent neural networks

We propose an Encoder-Classifier framework to model the Mandarin tones using recurrent neural networks (RNN). In this framework, extracted frames of features for tone classification are fed in to the RNN and casted into a fixed dimensional vector (tone embedding) and then classified into tone types using a softmax layer along with other auxiliary inputs. We investigate various configurations that help to improve the model, including pooling, feature splicing and utilization of syllable-level tone embeddings. Besides, tone embeddings and durations of the contextual syllables are exploited to facilitate tone classification. Experimental results on Mandarin tone classification show the proposed network setups improve tone classification accuracy. The results indicate that the RNN encoder-classifier based tone model flexibly accommodates heterogeneous inputs (sequential and segmental) and hence has the advantages from both the sequential classification tone models and segmental classification tone models.

[1]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jian Cheng Automatic Tone Assessment of Non-Native Mandarin Speakers , 2012, INTERSPEECH.

[4]  Chiu-yu Tseng,et al.  Complete recognition of continuous Mandarin speech for Chinese language with very large vocabulary but limited training data , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Jia-Lin Shen,et al.  Complete recognition of continuous Mandarin speech for Chinese language with very large vocabulary using limited training data , 1997, IEEE Trans. Speech Audio Process..

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Wenju Liu,et al.  Deep neural networks for Mandarin tone recognition , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[8]  Boonserm Kijsirikul,et al.  Tone Recognition of Continuous Thai Speech Under Tonal Assimilation and Declination Effects Using Half-Tone Model , 2001, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[9]  Sin-Horng Chen,et al.  Tone recognition of continuous Mandarin speech based on neural networks , 1995, IEEE Trans. Speech Audio Process..

[10]  Man-Hung Siu,et al.  Decision tree based tone modeling for Chinese speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Yang Cao,et al.  Tone Modeling for Continuous Mandarin Speech Recognition , 2004, Int. J. Speech Technol..

[12]  WangXiao-Dong,et al.  Tone Recognition of Continuous Mandarin Speech Based on Tone Nucleus Model and Neural Network , 2008 .

[13]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[14]  Frank Seide,et al.  Pitch tracking and tone features for Mandarin speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[15]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[16]  Chang Liu,et al.  Tone Classification in Mandarin Chinese Using Convolutional Neural Networks , 2016, INTERSPEECH.

[17]  Franck Dernoncourt,et al.  Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks , 2016, NAACL.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  Hao Huang,et al.  Discriminative incorporation of explicitly trained tone models into lattice based rescoring for Mandarin speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[22]  Mark Liberman,et al.  Highly Accurate Mandarin Tone Classification In The Absence of Pitch Information , 2014 .

[23]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[24]  Yujia Li,et al.  Overlapped di-tone modeling for tone recognition in continuous Cantonese speech , 2003, INTERSPEECH.

[25]  Jie Zhu,et al.  Discriminative tonal feature extraction method in mandarin speech recognition , 2007 .

[26]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[27]  Gang Peng,et al.  Tone recognition of continuous Cantonese speech based on support vector machines , 2005, Speech Commun..