A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

By representing speaker characteristic as a single fixed-length vector extracted solely from speech, we can train a neural multi-speaker speech synthesis model by conditioning the model on those vectors. This model can also be adapted to unseen speakers regardless of whether the transcript of adaptation data is available or not. However, this setup restricts the speaker component to just a single bias vector, which in turn limits the performance of adaptation process. In this study, we propose a novel speech synthesis model, which can be adapted to unseen speakers by fine-tuning part of or all of the network using either transcribed or untranscribed speech. Our methodology essentially consists of two steps: first, we split the conventional acoustic model into a speaker-independent (SI) linguistic encoder and a speaker-adaptive (SA) acoustic decoder; second, we train an auxiliary acoustic encoder that can be used as a substitute for the linguistic encoder whenever linguistic features are unobtainable. The results of objective and subjective evaluations show that adaptation using either transcribed or untranscribed speech with our methodology achieved a reasonable level of performance with an extremely limited amount of data and greatly improved performance with more data. Surprisingly, adaptation with untranscribed speech surpassed the transcribed counterpart in the subjective test, which reveals the limitations of the conventional acoustic model and hints at potential directions for improvements.

[1]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Heng Lu,et al.  Linear Networks Based Speaker Adaptation for Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Heiga Zen,et al.  Sample Efficient Adaptive Text-to-Speech , 2018, ICLR.

[4]  Junichi Yamagishi,et al.  Multimodal speech synthesis architecture for unsupervised speaker adaptation , 2018, INTERSPEECH.

[5]  Slava Shechtman,et al.  Neural TTS Voice Conversion , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[6]  Susan Fitt,et al.  Robust LTS rules with the Combilex speech technology lexicon , 2009, INTERSPEECH.

[7]  C. Bishop Mixture density networks , 1994 .

[8]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[9]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[10]  Khe Chai Sim,et al.  Factorized Hidden Layer Adaptation for Deep Neural Network Based Acoustic Modeling , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[12]  Tomoki Toda,et al.  An investigation of multi-speaker training for wavenet vocoder , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Junichi Yamagishi,et al.  Adapting and controlling DNN-based speech synthesis using input codes , 2017, ICASSP.

[14]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  Zhizheng Wu,et al.  A study of speaker adaptation for DNN-based speech synthesis , 2015, INTERSPEECH.

[16]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[19]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[20]  Shinnosuke Takamichi,et al.  Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Richard Socher,et al.  Quasi-Recurrent Neural Networks , 2016, ICLR.

[22]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[23]  Li-Rong Dai,et al.  WaveNet Vocoder with Limited Training Data for Voice Conversion , 2018, INTERSPEECH.

[24]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[25]  Frank K. Soong,et al.  Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice , 2018, ArXiv.

[26]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Kai Yu,et al.  Cluster adaptive training for deep neural network , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Junichi Yamagishi,et al.  Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[29]  Ranniery Maia,et al.  Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors , 2017, INTERSPEECH.

[30]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[31]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[32]  Richard Sproat,et al.  TTS for Low Resource Languages: A Bangla Synthesizer , 2016, LREC.

[33]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[35]  Steve Renals,et al.  SAT-LHUC: Speaker adaptive training for learning hidden unit contributions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[37]  Lior Wolf,et al.  VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.

[38]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[39]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[40]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[41]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[42]  Haizhou Li,et al.  A Voice Conversion Framework with Tandem Feature Sparse Representation and Speaker-Adapted WaveNet Vocoder , 2018, INTERSPEECH.

[43]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[44]  Khe Chai Sim,et al.  Subspace LHUC for Fast Adaptation of Deep Neural Network Acoustic Models , 2016, INTERSPEECH.

[45]  Heiga Zen,et al.  Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN Based Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.

[46]  Satoshi Nakamura,et al.  Machine Speech Chain with One-shot Speaker Adaptation , 2018, INTERSPEECH.

[47]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[48]  Tomoki Toda,et al.  Collapsed speech segment detection and suppression for WaveNet vocoder , 2018, INTERSPEECH.

[49]  Nobuaki Minematsu,et al.  Speaker Representations for Speaker Adaptation in Multiple Speakers' BLSTM-RNN-Based Speech Synthesis , 2016, INTERSPEECH.

[50]  Yusuke Ijima,et al.  DNN-Based Speech Synthesis Using Speaker Codes , 2018, IEICE Trans. Inf. Syst..

[51]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[52]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[53]  Junichi Yamagishi,et al.  Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis , 2016, SSW.

[54]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[55]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[56]  Yifan Gong,et al.  Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Yifan Gong,et al.  Speaker Adaptation for End-to-End CTC Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[59]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[60]  Lior Wolf,et al.  Voice Synthesis for in-the-Wild Speakers via a Phonological Loop , 2017, ArXiv.

[61]  Xiaodong Cui,et al.  Embedding-Based Speaker Adaptive Training of Deep Neural Networks , 2017, INTERSPEECH.

[62]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[63]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[64]  Junichi Yamagishi,et al.  Scaling and Bias Codes for Modeling Speaker-Adaptive DNN-Based Speech Synthesis Systems , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[65]  Lior Wolf,et al.  Fitting New Speakers Based on a Short Untranscribed Sample , 2018, ICML.

[66]  Sercan Ömer Arik,et al.  Neural Voice Cloning with a Few Samples , 2018, NeurIPS.

[67]  Frank K. Soong,et al.  Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Yifan Gong,et al.  Low-rank plus diagonal adaptation for deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.