AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multi-speaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset, baseline system code and generated samples are available online.

[1]  Keikichi Hirose,et al.  Improved Prediction of Japanese Word Accent Sandhi Using CRF , 2012, INTERSPEECH.

[2]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[3]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[4]  Xin Wang,et al.  Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Heiga Zen,et al.  Fully-Hierarchical Fine-Grained Prosody Modeling For Interpretable Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Ryan Prenger,et al.  Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Zhiqiang Huang,et al.  A Mandarin Prosodic Boundary Prediction Model Based on Multi-Task Learning , 2019, INTERSPEECH.

[8]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[9]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[10]  Hongwu Yang,et al.  Mandarin Prosody Boundary Prediction based on Sequence-to-sequence Model , 2020, 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC).

[11]  Sheng Zhang,et al.  A Prosodic Mandarin Text-to-Speech System Based on Tacotron , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[12]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[13]  Ming Li,et al.  On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[15]  Heiga Zen,et al.  Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Ming Li,et al.  Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features , 2019, INTERSPEECH.

[17]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[18]  Jiangyan Yi,et al.  Forward–Backward Decoding Sequence for Regularizing End-to-End TTS , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[21]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[22]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[24]  Ming Li,et al.  From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint , 2020, INTERSPEECH.

[25]  Soroosh Mariooryad,et al.  Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).