Improving Mandarin Tone Recognition Using Convolutional Bidirectional Long Short-Term Memory with Attention

Automatic tone recognition is useful for Mandarin spoken language processing. However, the complex F0 variations from the tone co-articulations and the interplay effects among tonality make it rather difficult to perform tone recognition of Chinese continuous speech. This paper explored the application of Bidirectional Long Short-Term Memory (BLSTM), which had the capability of modeling time series, to Mandarin tone recognition to handle the tone variations in continuous speech. In addition, we introduced attention mechanism to guide the model to select the suitable context information. The experimental results showed that the performance of proposed CNN-BLSTM with attention mechanism was the best and it achieved the tone error rate (TER) of 9.30% with a 17.6% relative error reduction from the DNN baseline system with TER of 11.28%. It demonstrated that our proposed model was more effective to handle the complex F0 variations than other models.

[1]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[2]  Bo Xu,et al.  Update progress of Sinohear: advanced Mandarin LVCSR system at NLPR , 2000, INTERSPEECH.

[3]  이창기 Long Short-Term Memory 기반의 Recurrent Neural Network를 이용한 개체명 인식 , 2015 .

[4]  Wenju Liu,et al.  Deep neural networks for Mandarin tone recognition , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[5]  Yi Xu,et al.  Effects of tone and focus on the formation and alignment of f0contours , 1999 .

[6]  Chiu-yu Tseng,et al.  Improved tone concatenation rules in a formant-based Chinese text-to-speech system , 1993, IEEE Trans. Speech Audio Process..

[7]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[8]  Chang Liu,et al.  Tone Classification in Mandarin Chinese Using Convolutional Neural Networks , 2016, INTERSPEECH.

[9]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[10]  Keikichi Hirose,et al.  Tone nucleus-based multi-level robust acoustic tonal modeling of sentential F0 variations for Chinese continuous speech tone recognition , 2005, Speech Commun..

[11]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[12]  Wei Li,et al.  Improving Mandarin Tone Recognition Based on DNN by Combining Acoustic and Articulatory Features Using Extended Recognition Networks , 2018, J. Signal Process. Syst..

[13]  Keikichi Hirose,et al.  Tone nucleus modeling for Chinese lexical tone recognition , 2004, Speech Commun..

[14]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  N. Umeda F0 Declination is situation dependent , 1980 .

[16]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[17]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[18]  Yi Xu Contextual tonal variations in Mandarin , 1997 .

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.