Improving Mongolian Phrase Break Prediction by Using Syllable and Morphological Embeddings with BiLSTM Model

In the speech synthesis systems, the phrase break (PB) prediction is the first and most important step. Recently, the state-of-the-art PB prediction systems mainly rely on word embeddings. However this method is not fully applicable to Mongolian language, because its word embeddings are inadequate trained, owing to the lack of resources. In this paper, we introduce a bidirectional Long Short Term Memory (BiLSTM) model which combined word embeddings with syllable and morphological embedding representations to provide richer and multi-view information which leverages the agglutinative property. Experimental results show the proposed method outperforms compared systems which only used the word embeddings. In addition, further analysis shows that it is quite robust to the Out-of-Vocabulary (OOV) problem owe to the refined word embedding. The proposed method achieves the state-of-the-art performance in the Mongolian PB prediction.

[1]  Oliver Watts,et al.  Neural net word representations for phrase-break prediction without a part of speech tagger , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[3]  Kishore Prahallad,et al.  Learning continuous-valued word representations for phrase break prediction , 2014, INTERSPEECH.

[4]  Guanglai Gao,et al.  Mongolian Text-to-Speech System Based on Deep Neural Network , 2017 .

[5]  Hai Zhao,et al.  Word embedding for recurrent neural network based TTS synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[9]  Richard Socher,et al.  Dynamic Coattention Networks For Question Answering , 2016, ICLR.

[10]  Guanglai Gao,et al.  Segmentation-based Mongolian LVCSR approach , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Alan W. Black,et al.  A Grammar Based Approach to Style Specific Phrase Prediction , 2011, INTERSPEECH.

[12]  Alan W. Black,et al.  Minimum error rate training for phrasing in speech synthesis , 2013, SSW.

[13]  Oliver Watts,et al.  Unsupervised Continuous-Valued Word Features for Phrase-Break Prediction without a Part-of-Speech Tagger , 2011, INTERSPEECH.

[14]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[15]  Jason Weston,et al.  Question Answering with Subgraph Embeddings , 2014, EMNLP.

[16]  Ya Li,et al.  Improving Prosodic Boundaries Prediction for Mandarin Speech Synthesis by Using Enhanced Embedding Feature and Model Fusion Approach , 2016, INTERSPEECH.

[17]  Bhuvana Ramabhadran,et al.  Using continuous lexical embeddings to improve symbolic-prosody prediction in a text-to-speech front-end , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Richard M. Schwartz,et al.  Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[19]  Zhao Jian-don Research on HMM-based Mongolian Speech Synthesis , 2014 .

[20]  Suryakanth V. Gangashetty,et al.  An Investigation of Recurrent Neural Network Architectures Using Word Embeddings for Phrase Break Prediction , 2016, INTERSPEECH.

[21]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  Zhizheng Wu,et al.  Automatic prosody prediction and detection with Conditional Random Field (CRF) models , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[24]  Zhizheng Wu,et al.  Deep neural network context embeddings for model selection in rich-context HMM synthesis , 2015, INTERSPEECH.

[25]  Colin W. Wightman,et al.  Segmental durations in the vicinity of prosodic phrase boundaries. , 1992, The Journal of the Acoustical Society of America.

[26]  Kishore Prahallad,et al.  Learning speaker-specific phrase breaks for text-to-speech systems , 2010, SSW.

[27]  Mark Hasegawa-Johnson,et al.  Acoustic differentiation of L- and L-L% in switchboard and radio news speech , 2006 .

[28]  Yang Liu,et al.  Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).