Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages

Bidirectional long short-term memory (BLSTM) based speech synthesis has shown great potential in improving the quality of the synthetic speech. However, for low-resource languages, it is difficult to obtain a high quality BLSTM model. BLSTM based speech synthesis can be viewed as a transformation between the input features and the output features. We assume that the input and output layers of BLSTM are language-dependent while the hidden layers can be language-independent if trained properly. We investigate whether sufficient training data of another language (auxiliary) can benefit the BLSTM training of a new language (target) that has only limited training data. In this paper, we propose 1) a multilingual BLSTM that shares hidden layers across different languages and 2) a specific training approach that can best utilize the training data from both the auxiliary and target languages. Experimental results demonstrate the effectiveness of the proposed approach. The multilingual BLSTM can learn the cross-lingual information, and can predict more accurate acoustic features for speech synthesis of the target language than the monolingual BLSTM that is trained with only the data from the target language. Subjective test also indicates that multilingual BLSTM outperforms the monolingual BLSTM in generating higher quality synthetic speech.

[1]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[2]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[4]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[5]  Frank K. Soong,et al.  Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[9]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[11]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[12]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Heiga Zen,et al.  Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[16]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).