论文信息 - Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information

Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information

Cross-lingual text-to-speech (TTS) synthesis on monolingual corpora is still a challenging task, especially when many kinds of languages are involved. In this paper, we improve the cross-lingual TTS model on monolingual corpora with pitch contour information. We propose a method to obtain pitch contour sequences for different languages without manual annotation, and extend the Tacotron-based TTS model with the proposed Pitch Contour Extraction (PCE) module. Our experimental results show that the proposed approach can effectively improve the naturalness and consistency of synthesized mixedlingual utterances.

[1] Masanori Morise,et al. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[2] Shinnosuke Takamichi,et al. Acoustic model-based subword tokenization and prosodic-context extraction without language knowledge for text-to-speech synthesis , 2020, Speech Commun..

[3] U. Barbara. Disentangling stress and pitch accent : A typology of prominence at different prosodic levels 1 , 2012 .

[4] Tao Qin,et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[5] Kyubyong Park,et al. CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages , 2019, INTERSPEECH.

[6] Songxiang Liu,et al. Code-Switched Speech Synthesis Using Bilingual Phonetic Posteriorgram with Only Monolingual Corpora , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Rob Goedemans,et al. A survey of word accentual patterns in the languages of the world , 2010 .

[8] Chengzhu Yu,et al. DurIAN: Duration Informed Attention Network for Speech Synthesis , 2020, INTERSPEECH.

[9] Tara N. Sainath,et al. Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] D. Horga. HANDBOOK OF THE INTERNATIONAL PHONETIC ASSOCIATION. A GUIDE TO THE USE OF THE INTERNATIONAL PHONETIC ALPHABET Cambridge: Cambridge University Press (1999), (204 stranice) , 1999 .

[11] Erich Elsen,et al. End-to-End Adversarial Text-to-Speech , 2020, ArXiv.

[12] Lei Chen,et al. Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding , 2019, INTERSPEECH.

[13] Lei He,et al. Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS , 2019, INTERSPEECH.

[14] Chunghyun Ahn,et al. Emotional Speech Synthesis with Rich and Granularized Control , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Li-Rong Dai,et al. Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Wei Song,et al. Building a mixed-lingual neural TTS system with only monolingual data , 2019, INTERSPEECH.

[17] Zhengchen Zhang,et al. A light-weight method of building an LSTM-RNN-based bilingual tts system , 2017, 2017 International Conference on Asian Language Processing (IALP).

[18] Hung-yi Lee,et al. End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning , 2019, INTERSPEECH.

[19] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Xu Tan,et al. FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[21] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[22] Ondrej Dusek,et al. One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech , 2020, INTERSPEECH.

[23] Xin Wang,et al. Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[25] Heiga Zen,et al. Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[26] Yuxuan Wang,et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[27] Shinnosuke Takamichi,et al. Prosody-aware subword embedding considering Japanese intonation systems and its application to DNN-based multi-dialect speech synthesis , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[28] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[29] Lior Wolf,et al. Unsupervised Polyglot Text-to-speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Jianwei Yu,et al. End-to-end Code-switched TTS with Mix of Monolingual Recordings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[32] Yuxuan Wang,et al. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.