WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU

In this paper, we propose WG-WaveNet, a fast, lightweight, and high-quality waveform generation model. WG-WaveNet is composed of a compact flow-based model and a post-filter. The two components are jointly trained by maximizing the likelihood of the training data and optimizing loss functions on the frequency domains. As we design a flow-based model that is heavily compressed, the proposed model requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post-filter maintains the quality of generated waveform. Our PyTorch implementation can be trained using less than 8 GB GPU memory and generates audio samples at a rate of more than 960 kHz on an NVIDIA 1080Ti GPU. Furthermore, even if synthesizing on a CPU, we show that the proposed method is capable of generating 44.1 kHz speech waveform 1.2 times faster than real-time. Experiments also show that the quality of generated audio is comparable to those of other methods. Audio samples are publicly available online.

[1]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Xin Wang,et al.  STFT Spectral Loss for Training a Neural Speech Waveform Model , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yu Tsao,et al.  MITAS: A Compressed Time-Domain Audio Separation Network with Parameter Sharing , 2019, ArXiv.

[4]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[6]  Kurt Keutzer,et al.  SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis , 2020, ArXiv.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[10]  Gregory Diamos,et al.  Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks , 2018, IEEE Signal Processing Letters.

[11]  Hirokazu Kameoka,et al.  Training a Neural Speech Waveform Model using Spectral Losses of Short-Time Fourier Transform and Continuous Wavelet Transform , 2019, ArXiv.

[12]  Lin-Shan Lee,et al.  Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations , 2018, INTERSPEECH.

[13]  Mark Hasegawa-Johnson,et al.  Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[14]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[15]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[17]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[18]  Xavier Serra,et al.  End-to-end music source separation: is it possible in the waveform domain? , 2018, INTERSPEECH.

[19]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[20]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[21]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[22]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[23]  Adam Finkelstein,et al.  Fftnet: A Real-Time Speaker-Dependent Neural Vocoder , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).