Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information

Cross-lingual text-to-speech (TTS) synthesis on monolingual corpora is still a challenging task, especially when many kinds of languages are involved. In this paper, we improve the cross-lingual TTS model on monolingual corpora with pitch contour information. We propose a method to obtain pitch contour sequences for different languages without manual annotation, and extend the Tacotron-based TTS model with the proposed Pitch Contour Extraction (PCE) module. Our experimental results show that the proposed approach can effectively improve the naturalness and consistency of synthesized mixedlingual utterances.

[1]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[2]  Shinnosuke Takamichi,et al.  Acoustic model-based subword tokenization and prosodic-context extraction without language knowledge for text-to-speech synthesis , 2020, Speech Commun..

[3]  U. Barbara Disentangling stress and pitch accent : A typology of prominence at different prosodic levels 1 , 2012 .

[4]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[5]  Kyubyong Park,et al.  CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages , 2019, INTERSPEECH.

[6]  Songxiang Liu,et al.  Code-Switched Speech Synthesis Using Bilingual Phonetic Posteriorgram with Only Monolingual Corpora , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Rob Goedemans,et al.  A survey of word accentual patterns in the languages of the world , 2010 .

[8]  Chengzhu Yu,et al.  DurIAN: Duration Informed Attention Network for Speech Synthesis , 2020, INTERSPEECH.

[9]  Tara N. Sainath,et al.  Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  D. Horga HANDBOOK OF THE INTERNATIONAL PHONETIC ASSOCIATION. A GUIDE TO THE USE OF THE INTERNATIONAL PHONETIC ALPHABET Cambridge: Cambridge University Press (1999), (204 stranice) , 1999 .

[11]  Erich Elsen,et al.  End-to-End Adversarial Text-to-Speech , 2020, ArXiv.

[12]  Lei Chen,et al.  Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding , 2019, INTERSPEECH.

[13]  Lei He,et al.  Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS , 2019, INTERSPEECH.

[14]  Chunghyun Ahn,et al.  Emotional Speech Synthesis with Rich and Granularized Control , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Li-Rong Dai,et al.  Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Wei Song,et al.  Building a mixed-lingual neural TTS system with only monolingual data , 2019, INTERSPEECH.

[17]  Zhengchen Zhang,et al.  A light-weight method of building an LSTM-RNN-based bilingual tts system , 2017, 2017 International Conference on Asian Language Processing (IALP).

[18]  Hung-yi Lee,et al.  End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning , 2019, INTERSPEECH.

[19]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[21]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[22]  Ondrej Dusek,et al.  One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech , 2020, INTERSPEECH.

[23]  Xin Wang,et al.  Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[25]  Heiga Zen,et al.  Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[26]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[27]  Shinnosuke Takamichi,et al.  Prosody-aware subword embedding considering Japanese intonation systems and its application to DNN-based multi-dialect speech synthesis , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[28]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[29]  Lior Wolf,et al.  Unsupervised Polyglot Text-to-speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Jianwei Yu,et al.  End-to-end Code-switched TTS with Mix of Monolingual Recordings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[32]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.