Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes

Multi-speaker speech synthesis is a technique for modeling multiple speakers' voices with a single model. Although many approaches using deep neural networks (DNNs) have been proposed, DNNs are prone to overfitting when the amount of training data is limited. We propose a framework for multi-speaker speech synthesis using deep Gaussian processes (DGPs); a DGP is a deep architecture of Bayesian kernel regressions and thus robust to overfitting. In this framework, speaker information is fed to duration/acoustic models using speaker codes. We also examine the use of deep Gaussian process latent variable models (DGPLVMs). In this approach, the representation of each speaker is learned simultaneously with other model parameters, and therefore the similarity or dissimilarity of speakers is considered efficiently. We experimentally evaluated two situations to investigate the effectiveness of the proposed methods. In one situation, the amount of data from each speaker is balanced (speaker-balanced), and in the other, the data from certain speakers are limited (speaker-imbalanced). Subjective and objective evaluation results showed that both the DGP and DGPLVM synthesize multi-speaker speech more effective than a DNN in the speaker-balanced situation. We also found that the DGPLVM outperforms the DGP significantly in the speaker-imbalanced situation.

[1]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[2]  Junichi Yamagishi,et al.  Adapting and controlling DNN-based speech synthesis using input codes , 2017, ICASSP.

[3]  Takao Kobayashi,et al.  Statistical Parametric Speech Synthesis Using Deep Gaussian Processes , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Shinnosuke Takamichi,et al.  DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis , 2019, ArXiv.

[6]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[7]  Neil D. Lawrence,et al.  Bayesian Gaussian Process Latent Variable Model , 2010, AISTATS.

[8]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[9]  Junichi Yamagishi,et al.  Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora , 2019, INTERSPEECH.

[10]  Tomoki Koriyama,et al.  JVS corpus: free Japanese multi-speaker voice corpus , 2019, ArXiv.

[11]  Marc Peter Deisenroth,et al.  Doubly Stochastic Variational Inference for Deep Gaussian Processes , 2017, NIPS.

[12]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[13]  Masanori Morise,et al.  D4C, a band-aperiodicity estimator for high-quality speech synthesis , 2016, Speech Commun..

[14]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[15]  Tomoki Koriyama,et al.  Semi-Supervised Prosody Modeling Using Deep Gaussian Process Latent Variable Model , 2019, INTERSPEECH.

[16]  Yusuke Ijima,et al.  DNN-Based Speech Synthesis Using Speaker Codes , 2018, IEICE Trans. Inf. Syst..

[17]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[18]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[20]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[21]  Frank K. Soong,et al.  Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).