HiFiDenoise: High-Fidelity Denoising Text to Speech with Adversarial Networks

Building a high-fidelity speech synthesis system with noisy speech data is a challenging but valuable task, which could significantly reduce the cost of data collection. Existing methods usually train speech synthesis systems based on the speech denoised with an enhancement model or feed noise information as a condition into the system. These methods certainly have some effect on inhibiting noise, but the quality and the prosody of their synthesized speech are still far away from natural speech. In this paper, we propose HiFiDenoise, a speech synthesis system with adversarial networks that can synthesize high-fidelity speech with low-quality and noisy speech data. Specifically, 1) to tackle the difficulty of noise modeling, we introduce multi-length adversarial training in the noise condition module. 2) To handle the problem of inaccurate pitch extraction caused by noise, we remove the pitch predictor in the acoustic model and also add discriminators on the mel-spectrogram generator. 3) In addition, we also apply HiFiDenoise to singing voice synthesis with a noisy singing dataset. Experiments show that our model outperforms the baseline by 0.36 and 0.44 in terms of MOS on speech and singing respectively.