StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization

In recent years, neural vocoders have surpassed classical speech generation approaches in naturalness and perceptual quality of the synthesized speech. Computationally heavy models like WaveNet and WaveGlow achieve best results, while lightweight GAN models, e.g. MelGAN and Parallel WaveGAN, remain inferior in terms of perceptual quality. We therefore propose StyleMelGAN, a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity. StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech. For efficient training, multiple random-window discriminators adversarially evaluate the speech signal analyzed by a filter bank, with regularization provided by a multi-scale spectral reconstruction loss. The highly parallelizable speech generation is several times faster than real-time on CPUs and GPUs. MUSHRA and P.800 listening tests show that StyleMelGAN outperforms prior neural vocoders in copy-synthesis and Text-to-Speech scenarios.

[1]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[2]  Truong Q. Nguyen Near-perfect-reconstruction pseudo-QMF banks , 1994, IEEE Trans. Signal Process..

[3]  Jasper Snoek,et al.  A Spectral Energy Distance for Parallel Speech Synthesis , 2020, NeurIPS.

[4]  METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .

[5]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[6]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[7]  Christian Dittmar,et al.  A Comparison of Recent Neural Vocoders for Speech Signal Reconstruction , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[8]  Zohaib Ahmed,et al.  HooliGAN: Robust, High Quality Neural Vocoding , 2020, ArXiv.

[9]  Chris Donahue,et al.  Synthesizing Audio with Generative Adversarial Networks , 2018, ArXiv.

[10]  Wei Chen,et al.  Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech , 2020, ArXiv.

[11]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[12]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[14]  Tomoki Toda,et al.  Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[16]  Youngik Kim,et al.  VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network , 2020, INTERSPEECH.

[17]  Kumar Krishna Agrawal,et al.  GANSynth: Adversarial Neural Audio Synthesis , 2019, ICLR.

[18]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[19]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Arijit Biswas,et al.  Analysis by Adversarial Synthesis - A Novel Approach for Speech Vocoding , 2019, INTERSPEECH.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Jae Hyun Lim,et al.  Geometric GAN , 2017, ArXiv.

[23]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[24]  Erich Elsen,et al.  High Fidelity Speech Synthesis with Adversarial Networks , 2019, ICLR.

[25]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[26]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).