Adversarially Trained End-to-end Korean Singing Voice Synthesis System

In this paper, we propose an end-to-end Korean singing voice synthesis system from lyrics and a symbolic melody using the following three novel approaches: 1) phonetic enhancement masking, 2) local conditioning of text and pitch to the super-resolution network, and 3) conditional adversarial training. The proposed system consists of two main modules; a mel-synthesis network that generates a mel-spectrogram from the given input information, and a super-resolution network that upsamples the generated mel-spectrogram into a linear-spectrogram. In the mel-synthesis network, phonetic enhancement masking is applied to generate implicit formant masks solely from the input text, which enables a more accurate phonetic control of singing voice. In addition, we show that two other proposed methods -- local conditioning of text and pitch, and conditional adversarial training -- are crucial for a realistic generation of the human singing voice in the super-resolution process. Finally, both quantitative and qualitative evaluations are conducted, confirming the validity of all proposed methods.

[1]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[2]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[3]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[4]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[5]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[6]  Yoshihiko Nankaku,et al.  Singing Voice Synthesis Based on Deep Neural Networks , 2016, INTERSPEECH.

[7]  Yuxuan Wang,et al.  Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[9]  Sangjin Kim,et al.  Korean Singing Voice Synthesis System based on an LSTM Recurrent Neural Network , 2018 .

[10]  Hideyuki Tachibana,et al.  Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jong-Jin Kim,et al.  Korean Singing Voice Synthesis Based on an LSTM Recurrent Neural Network , 2018, INTERSPEECH.

[12]  Jordi Bonada,et al.  A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs , 2017 .

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[15]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[16]  Stephane Villette,et al.  Speech Bandwidth Extension Using Generative Adversarial Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[18]  Mark A. Clements,et al.  Concatenation-Based MIDI-to-Singing Voice Synthesis , 1997 .

[19]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[20]  Lars M. Mescheder,et al.  On the convergence properties of GAN training , 2018, ArXiv.

[21]  Takeru Miyato,et al.  cGANs with Projection Discriminator , 2018, ICLR.

[22]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[23]  Sebastian Nowozin,et al.  Which Training Methods for GANs do actually Converge? , 2018, ICML.

[24]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[25]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[26]  Hideki Kenmochi,et al.  VOCALOID - commercial singing synthesizer based on sample concatenation , 2007, INTERSPEECH.