论文信息 - Adversarially Trained End-to-end Korean Singing Voice Synthesis System - 字舞流文

Adversarially Trained End-to-end Korean Singing Voice Synthesis System

In this paper, we propose an end-to-end Korean singing voice synthesis system from lyrics and a symbolic melody using the following three novel approaches: 1) phonetic enhancement masking, 2) local conditioning of text and pitch to the super-resolution network, and 3) conditional adversarial training. The proposed system consists of two main modules; a mel-synthesis network that generates a mel-spectrogram from the given input information, and a super-resolution network that upsamples the generated mel-spectrogram into a linear-spectrogram. In the mel-synthesis network, phonetic enhancement masking is applied to generate implicit formant masks solely from the input text, which enables a more accurate phonetic control of singing voice. In addition, we show that two other proposed methods -- local conditioning of text and pitch, and conditional adversarial training -- are crucial for a realistic generation of the human singing voice in the super-resolution process. Finally, both quantitative and qualitative evaluations are conducted, confirming the validity of all proposed methods.

Kyogu Lee | Juheon Lee | Hyeong-Seok Choi | Junghyun Koo | Chang-Bin Jeon | Kyogu Lee | Juheon Lee | Junghyun Koo | Hyeong-Seok Choi | Chang-Bin Jeon

[1] Aaron C. Courville,et al. Improved Training of Wasserstein GANs , 2017, NIPS.

[2] Masanori Morise,et al. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[3] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[4] Yuichi Yoshida,et al. Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[5] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[6] Yoshihiko Nankaku,et al. Singing Voice Synthesis Based on Deep Neural Networks , 2016, INTERSPEECH.

[7] Yuxuan Wang,et al. Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Sercan Ömer Arik,et al. Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[9] Sangjin Kim,et al. Korean Singing Voice Synthesis System based on an LSTM Recurrent Neural Network , 2018 .

[10] Hideyuki Tachibana,et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Jong-Jin Kim,et al. Korean Singing Voice Synthesis Based on an LSTM Recurrent Neural Network , 2018, INTERSPEECH.

[12] Jordi Bonada,et al. A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs , 2017 .

[13] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[15] Jeff Donahue,et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[16] Stephane Villette,et al. Speech Bandwidth Extension Using Generative Adversarial Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[18] Mark A. Clements,et al. Concatenation-Based MIDI-to-Singing Voice Synthesis , 1997 .

[19] Jae Lim,et al. Signal estimation from modified short-time Fourier transform , 1984 .

[20] Lars M. Mescheder,et al. On the convergence properties of GAN training , 2018, ArXiv.

[21] Takeru Miyato,et al. cGANs with Projection Discriminator , 2018, ICLR.

[22] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[23] Sebastian Nowozin,et al. Which Training Methods for GANs do actually Converge? , 2018, ICML.

[24] Yuxuan Wang,et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[25] Yuxuan Wang,et al. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[26] Hideki Kenmochi,et al. VOCALOID - commercial singing synthesizer based on sample concatenation , 2007, INTERSPEECH.