Our baseline SVS system differs from the existing singing synthesis studies in three respects. First, our network uses lyric information and pitch information independently when creating a mel-spectrogram. We assumed that implementing the principles of vocal organs proposed by the source-filter model [1] in a network structure would allow training data to be used more effectively for training, so we designed a phonetic enhancement mask decoder that only decode lyric information, and a mel decoder that decode pitch information. As a result, our network was able to train both information independently, as shown in figure 1-(b). Second, our network uses encoded text and pitch information conditioned to the super-resolution stage. We assumed that the mel-spectrogram generated by our network at the intermediate stage may be incomplete, which might cause a drop in the performance of the SR process. In order to solve this problem, we used a method to recycle not sufficiently conditioned pitch and lyric encoding information into the SR process as
[1]
Kyogu Lee,et al.
Adversarially Trained End-to-end Korean Singing Voice Synthesis System
,
2019,
INTERSPEECH.
[2]
Jae Lim,et al.
Signal estimation from modified short-time Fourier transform
,
1984
.
[3]
Takeru Miyato,et al.
cGANs with Projection Discriminator
,
2018,
ICLR.
[4]
J. Liljencrants,et al.
Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow
,
2022
.
[5]
Yoshua Bengio,et al.
Generative Adversarial Nets
,
2014,
NIPS.