An Investigation of DNN-Based Speech Synthesis Using Speaker Codes

Recent studies have shown that DNN-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, an open problem remains as to whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a simple method to improve the performance of the conventional speaker dependent DNN-based method. In order to model speaker variation in the DNN, the augmented feature (speaker codes) is fed to the hidden layer(s) of the conventional DNN. The proposed method trains connection weights of the whole DNN using a multispeaker speech corpus. When synthesizing a speech parameter sequence, a target speaker is chosen from the corpus and the speaker code corresponding to the selected target speaker is fed to the DNN to generate the speaker’s voice. We investigated the relationship between the prediction performance and architecture of the DNNs by changing the input hidden layer for speaker codes. Experimental results showed that the proposed model outperformed the conventional speaker-dependent DNN when the model architecture was set at optimal for the amount of training data of the selected target speaker.

[1]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[2]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[4]  Zhizheng Wu,et al.  A study of speaker adaptation for DNN-based speech synthesis , 2015, INTERSPEECH.

[5]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[6]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[7]  Blaise Potard,et al.  Preliminary Work on Speaker Adaptation for DNN-Based Speech Synthesis , 2015 .

[8]  Frank K. Soong,et al.  Unsupervised speaker adaptation for DNN-based TTS synthesis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Heiga Zen,et al.  Combining multiple high quality corpora for improving HMM-TTS , 2012, INTERSPEECH.

[12]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Frank K. Soong,et al.  Speaker and language factorization in DNN-based TTS synthesis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Li-Rong Dai,et al.  Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[16]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  Frank K. Soong,et al.  Modeling F0 trajectories in hierarchically structured deep neural networks , 2016, Speech Commun..