Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost. Existing open corpora could not meet requirements for high-fidelity singing voice synthesis because of the scale and quality weaknesses. Previous vocoders have difficulty in multi-singer modeling, and a distinct degradation emerges when conducting unseen singer singing voice generation. To accelerate singing voice researches in the community, we release a large-scale, multi-singer Chinese singing voice dataset OpenSinger. To tackle the difficulty in unseen singer modeling, we propose Multi-Singer, a fast multi-singer vocoder with generative adversarial networks. Specifically, 1) Multi-Singer uses a multi-band generator to speed up both training and inference procedure. 2) to capture and rebuild singer identity from the acoustic feature (i.e., mel-spectrogram), Multi-Singer adopts a singer conditional discriminator and conditional adversarial training objective. 3) to supervise the reconstruction of singer identity in the spectrum envelopes in frequency domain, we propose an auxiliary singer perceptual loss. The joint training approach effectively works in GANs for multi-singer voices modeling. Experimental results verify the effectiveness of OpenSinger and show that Multi-Singer improves unseen singer singing voices modeling in both speed and quality over previous methods. The further experiment proves that combined with FastSpeech 2 as the acoustic model, Multi-Singer achieves strong robustness in the multi-singer singing voice synthesis pipeline.

[1]  Wei Chen,et al.  Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech , 2020, ArXiv.

[2]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Erich Elsen,et al.  High Fidelity Speech Synthesis with Adversarial Networks , 2019, ICLR.

[4]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[5]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[6]  Cha Zhang,et al.  CROWDMOS: An approach for crowdsourcing mean opinion score studies , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Hui Bu,et al.  AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines , 2020, ArXiv.

[9]  Arun Ross,et al.  JukeBox: A Multilingual Singer Recognition Dataset , 2020, INTERSPEECH.

[10]  Yannis Stylianou,et al.  Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions , 2020, INTERSPEECH.

[11]  Tie-Yan Liu,et al.  SimulSpeech: End-to-End Simultaneous Speech to Text Translation , 2020, ACL.

[12]  Tie-Yan Liu,et al.  A Study of Non-autoregressive Model for Sequence Generation , 2020, ACL.

[13]  Zhou Zhao,et al.  WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution , 2021, Interspeech.

[14]  Xiangmin Xu,et al.  LSSED: A Large-Scale Dataset and Benchmark for Speech Emotion Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Leon A. Gatys,et al.  A Neural Algorithm of Artistic Style , 2015, ArXiv.

[16]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[18]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Wei Ping,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[20]  Benlai Tang,et al.  ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders , 2021, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[21]  Zhou Zhao,et al.  EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model , 2021, Interspeech.

[22]  Truong Q. Nguyen Near-perfect-reconstruction pseudo-QMF banks , 1994, IEEE Trans. Signal Process..

[23]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Chengzhu Yu,et al.  DurIAN: Duration Informed Attention Network For Multimodal Synthesis , 2019, ArXiv.

[25]  Shinnosuke Takamichi,et al.  JVS-MuSiC: Japanese multispeaker singing-voice corpus , 2020, ArXiv.

[26]  Xu Tan,et al.  XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System , 2020, INTERSPEECH.

[27]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[28]  Zhou Zhao,et al.  DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis , 2021, ArXiv.

[29]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2021, ICLR.

[30]  Juhan Nam,et al.  Korean Singing Voice Synthesis Based on Auto-Regressive Boundary Equilibrium Gan , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Tie-Yan Liu,et al.  HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis , 2020, ArXiv.

[32]  Youngik Kim,et al.  VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network , 2020, INTERSPEECH.

[33]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Xin Wang,et al.  Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis , 2020, ArXiv.

[35]  Tie-Yan Liu,et al.  DeepSinger: Singing Voice Synthesis with Data Mined From the Web , 2020, KDD.

[36]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[37]  Jasper Snoek,et al.  A Spectral Energy Distance for Parallel Speech Synthesis , 2020, NeurIPS.

[38]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[39]  Tao Qin,et al.  MultiSpeech: Multi-Speaker Text to Speech with Transformer , 2020, INTERSPEECH.

[40]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[41]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[43]  Ye Wang,et al.  The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.