Dynamic Soft Windowing and Language Dependent Style Token for Code-Switching End-to-End Speech Synthesis

Most of current end-to-end speech synthesis assumes the input text is in a single language situation. However, codeswitching in speech occurs frequently in routine life, in which speakers switch between languages in the same utterance. And building a large mixed-language speech database is difficult and uneconomical. In this paper, both windowing technique and style token modeling are designed for the code-switching endto-end speech synthesis. To improve the consistency of speaking style in bilingual situation, compared with the conventional windowing techniques that used fixed constraints, the dynamic attention reweighting soft windowing mechanism is proposed to ensure the smooth transition of code-switching. To compensate the shortage of mixed-language training data, the language dependent style token is designed for the cross-language multispeaker acoustic modeling, where both the Mandarin and English monolingual data are the extended training data set. The attention gating is proposed to adjust style token dynamically based on the language and the attended context infromation. Experimental results show that proposed methods lead to an improvement on intelligibility, naturalness and similarity.

[1]  Haifeng Li,et al.  A KL divergence and DNN approach to cross-lingual TTS , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Ming Zhou,et al.  Close to Human Quality TTS with Transformer , 2018, ArXiv.

[3]  Frank K. Soong,et al.  An HMM-based bilingual (Mandarin-English) TTS , 2007, SSW.

[4]  Frank K. Soong,et al.  HMM-Based Mixed-Language (Mandarin-English) Speech Synthesis , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[5]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[6]  Li-Rong Dai,et al.  Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Beat Pfister,et al.  From multilingual to polyglot speech synthesis , 1999, EUROSPEECH.

[8]  Frank K. Soong,et al.  A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Frank K. Soong,et al.  Speaker and language factorization in DNN-based TTS synthesis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Wei Song,et al.  Building a mixed-lingual neural TTS system with only monolingual data , 2019, INTERSPEECH.

[13]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[14]  Elizabeth A. Strickland,et al.  An Introduction to the Psychology of Hearing (6th edition) , 2014 .

[15]  Alan W. Black,et al.  Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text , 2016, SSW.

[16]  Tao Wang,et al.  Focusing on Attention: Prosody Transfer and Adaptative Optimization Strategy for Multi-Speaker End-to-End Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Alan W. Black,et al.  Speech Synthesis of Code-Mixed Text , 2016, LREC.

[18]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  Alan W. Black,et al.  Speech Synthesis for Mixed-Language Navigation Instructions , 2017, INTERSPEECH.

[21]  Frank K. Soong,et al.  Turning a Monolingual Speaker into Multilingual for a Mixed-language TTS , 2012, INTERSPEECH.

[22]  Sadaoki Furui,et al.  Polyglot synthesis using a mixture of monolingual corpora , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[23]  Tara N. Sainath,et al.  Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[25]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[26]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[27]  T. Nagarajan,et al.  Voice conversion-based multilingual to polyglot speech synthesizer for Indian languages , 2013, 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013).

[28]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[29]  E. Owens Introduction to the Psychology of Hearing , 1977 .

[30]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[31]  Heiga Zen,et al.  Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN Based Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.