DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis

Singing voice synthesis (SVS) system is built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e.g., mel-spectrogram) given a music score. Previous singing acoustic models adopt simple loss (e.g., L1 and L2) or generative adversarial network (GAN) to reconstruct the acoustic features, while they suffer from oversmoothing and unstable training issues respectively, which hinder the naturalness of synthesized singing. In this work, we propose DiffSinger, an acoustic model for SVS based on the diffusion probabilistic model. DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score. By implicitly optimizing variational bound, DiffSinger can be stably trained and generates realistic outputs. To further improve the voice quality, we introduce a shallow diffusion mechanism to make better use of the prior knowledge learned by the simple loss. Specifically, DiffSinger starts generation at a shallow step smaller than the total number of diffusion steps, according to the intersection of the diffusion trajectories of the ground-truth mel-spectrogram and the one predicted by a simple mel-spectrogram decoder. Besides, we train a boundary prediction network to locate the intersection and determine the shallow step adaptively. The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin (0.11 MOS gains). Our extensional experiments also prove the generalization of DiffSinger on text-to-speech task. Equal contribution Zhejiang University Tencent AI LAB. Correspondence to: Zhou Zhao <zhaozhou@zju.edu.cn>. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution.

[1]  Sangjin Kim,et al.  Korean Singing Voice Synthesis System based on an LSTM Recurrent Neural Network , 2018 .

[2]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[3]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[4]  Jie Wu,et al.  Adversarially Trained Multi-Singer Sequence-To-Sequence Singing Synthesizer , 2020, INTERSPEECH.

[5]  Jordi Bonada,et al.  Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tie-Yan Liu,et al.  DeepSinger: Singing Voice Synthesis with Data Mined From the Web , 2020, KDD.

[7]  Kyogu Lee,et al.  Adversarially Trained End-to-end Korean Singing Voice Synthesis System , 2019, INTERSPEECH.

[8]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[9]  Yoshihiko Nankaku,et al.  Recent Development of the DNN-based Singing Voice Synthesis System — Sinsy , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[10]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[11]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2021, ICLR.

[12]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[14]  Heiga Zen,et al.  An HMM-based singing voice synthesis system , 2006, INTERSPEECH.

[15]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[17]  Wei Ping,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[18]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[19]  Hideki Kenmochi,et al.  VOCALOID - commercial singing synthesizer based on sample concatenation , 2007, INTERSPEECH.

[20]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Junichi Yamagishi,et al.  Using Cyclic Noise as the Source Signal for Neural Source-Filter-Based Speech Waveform Model , 2020, INTERSPEECH.

[23]  Tie-Yan Liu,et al.  HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis , 2020, ArXiv.

[24]  Chengzhu Yu,et al.  DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System , 2020, INTERSPEECH.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Xu Tan,et al.  XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System , 2020, INTERSPEECH.

[27]  Benlai Tang,et al.  ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders , 2021, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[28]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2021, ICLR.

[29]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[30]  Mark A. Clements,et al.  Concatenation-Based MIDI-to-Singing Voice Synthesis , 1997 .

[31]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[32]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[33]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Jordi Bonada,et al.  A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs , 2017 .

[35]  Yoshihiko Nankaku,et al.  Singing Voice Synthesis Based on Deep Neural Networks , 2016, INTERSPEECH.

[36]  Yoshihiko Nankaku,et al.  Singing voice synthesis based on convolutional neural networks , 2019, ArXiv.