Speaker Adaptation of a Multilingual Acoustic Model for Cross-Language Synthesis

Several studies have shown promising results in adapting DNN- based acoustic models as a mechanism to transfer characteristics from pre-trained models. One such example is speaker adaptation using a small amount of data, where fine-tuning has helped train models that extrapolate well to diverse linguistic contexts that are not present in the adaptation data. In the current work, our objective is to synthesize speech in different languages using the target speaker’s voice, regardless of the language of their data. To achieve this goal, we create a multilingual model using a corpus that consists of recordings from a large number of monolingual and a few bilingual speakers in multiple languages. The model is then adapted using the target speaker’s recordings in a language other than the target language. We also explore if additional adaptation data from a native speaker of the target language improves the performance. The subjective evaluation shows that the proposed approach of cross-language speaker adaptation is able to synthesize speech in the target language, in the target speaker's voice, without data spoken by the target speaker in that language. Also, extra data from a native speaker of the target language can improve model performance.

[1]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[2]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[3]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[4]  Frank K. Soong,et al.  A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[6]  Heiga Zen,et al.  Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN Based Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.

[7]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[8]  Yusuke Ijima,et al.  DNN-Based Speech Synthesis Using Speaker Codes , 2018, IEICE Trans. Inf. Syst..

[9]  Sadaoki Furui,et al.  New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer , 2006, Speech Commun..

[10]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[11]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[12]  Keiichi Tokuda,et al.  Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis , 2014, Multimedia Tools and Applications.

[13]  Frank K. Soong,et al.  Speaker and language factorization in DNN-based TTS synthesis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[15]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Lior Wolf,et al.  Unsupervised Polyglot Text-to-speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Keiichi Tokuda,et al.  Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[18]  Sadaoki Furui,et al.  New Approach to Polyglot Synthesis: How to Speak any Language with Anyone's Voice , 2006 .

[19]  Ranniery Maia,et al.  Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors , 2017, INTERSPEECH.

[20]  Zhengchen Zhang,et al.  A light-weight method of building an LSTM-RNN-based bilingual tts system , 2017, 2017 International Conference on Asian Language Processing (IALP).

[21]  Junichi Yamagishi,et al.  Adapting and controlling DNN-based speech synthesis using input codes , 2017, ICASSP.

[22]  Chung-Hsien Wu,et al.  Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Simon King,et al.  Disentangling Style Factors from Speaker Representations , 2019, INTERSPEECH.

[24]  Heiga Zen,et al.  Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[25]  Sercan Ömer Arik,et al.  Neural Voice Cloning with a Few Samples , 2018, NeurIPS.

[26]  Frank K. Soong,et al.  Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  RECOMMENDATION ITU-R BS.1534-1 - Method for the subjective assessment of intermediate quality level of coding systems , 2003 .

[28]  Lior Wolf,et al.  VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.