MULTI-SINGER SINGING VOICE SYNTHESIS SYSTEM

Our baseline SVS system differs from the existing singing synthesis studies in three respects. First, our network uses lyric information and pitch information independently when creating a mel-spectrogram. We assumed that implementing the principles of vocal organs proposed by the source-filter model [1] in a network structure would allow training data to be used more effectively for training, so we designed a phonetic enhancement mask decoder that only decode lyric information, and a mel decoder that decode pitch information. As a result, our network was able to train both information independently, as shown in figure 1-(b). Second, our network uses encoded text and pitch information conditioned to the super-resolution stage. We assumed that the mel-spectrogram generated by our network at the intermediate stage may be incomplete, which might cause a drop in the performance of the SR process. In order to solve this problem, we used a method to recycle not sufficiently conditioned pitch and lyric encoding information into the SR process as