Automatic Prosodic Structure Labeling using DNN-BGRU-CRF Hybrid Neural Network

The speech corpus with labeled prosodic structure information is crucial for text-to-speech (TTS) synthesis to train a reliable model that can generate high quality natural synthetic speech. Traditional manual prosodic structure labeling is laborious and time-consuming and may encounter an inconsistency problem caused by different annotators. Automatic prosodic labeling is thus desirable, which can not only speed up the labeling process, but also protect the labeling results from the inconsistency problem. This paper presents a DNN-BGRU-CRF hybrid neural network, which aggregates the advantages of deep neural network, bidirectional gated recurrent units and conditional random fields, to label three-level prosodic structure boundaries. It exploits both text and acoustic cues in a neural network framework. Experimental results demonstrate the effectiveness of the proposed model.

[1]  LIU Yabin Cues of Prosodic Boundaries in Chi , 2003 .

[3]  Ren-Hua Wang,et al.  Chinese prosody phrase break prediction based on maximum entropy model , 2004, INTERSPEECH.

[4]  Eric Sanders,et al.  Using Statistical Models to Predict Phrase Boundaries for Speech Synthesis , 1995 .

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[7]  Li-Rong Dai,et al.  Automatic phrase boundary labeling for Mandarin TTS corpus using context-dependent HMM , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[8]  Gina-Anne Levow,et al.  Automatic Prosodic Labeling with Conditional Random Fields and Rich Acoustic Features , 2008, IJCNLP.

[9]  Wei Zhang,et al.  Automatic prosody labeling using both text and acoustic information , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  Stephen Cox,et al.  Using part-of-speech for predicting phrase breaks , 2004, INTERSPEECH.

[11]  Shrikanth S. Narayanan,et al.  An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[13]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[14]  Ziping Zhao,et al.  Active learning for the prediction of prosodic phrase boundaries in Chinese speech synthesis systems using conditional random fields , 2015, 2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[15]  J. Vaissière Rhythm, accentuation and final lengthening in French , 1991 .

[16]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[17]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[18]  Shrikanth S. Narayanan,et al.  Exploiting Acoustic and Syntactic Features for Prosody Labeling in a Maximum Entropy Framework , 2007, HLT-NAACL.

[19]  Zhizheng Wu,et al.  Automatic prosody prediction and detection with Conditional Random Field (CRF) models , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[20]  Xiaohua Shi,et al.  An RNN-based algorithm to detect prosodic phrase for Chinese TTS , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[21]  Wang Bei,et al.  Acoustic Correlates of Hierarchical Prosodic Boundary in Mandarin , 2002 .

[22]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.