WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

Recently, GAN-based neural vocoders such as Parallel WaveGAN[1], MelGAN[2], HiFiGAN[3], and UnivNet[4] have become popular due to their lightweight and parallel structure, resulting in a real-time synthesized waveform with high fidelity, even on a CPU. HiFiGAN[3] and UnivNet[4] are two SOTA vocoders. Despite their high quality, there is still room for improvement. In this paper, motivated by the structure of Vision Outlooker from computer vision, we adopt a similar idea and propose an effective and lightweight neural vocoder called WOLONet. In this network, we develop a novel lightweight block that uses a location-variable, channel-independent, and depthwise dynamic convolutional kernel with sinusoidally activated dynamic kernel weights. To demonstrate the effective-ness and generalizability of our method, we perform an ablation study to verify our novel design and make a subjective and objective comparison with typical GAN-based vocoders. The results show that our WOLONet achieves the best generation quality while requiring fewer parameters than the two neural SOTA vocoders, i.e., HiFiGAN and UnivNet.

[1]  Andreas Zell,et al.  Seeing Implicit Neural Representations as Fourier Series , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[2]  Shuicheng Yan,et al.  VOLO: Vision Outlooker for Visual Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Jaesam Yoon,et al.  UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation , 2021, Interspeech.

[4]  Jing Xiao,et al.  LVCNet: Efficient Condition-Dependent Modeling Network for Waveform Generation , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jing Xiao,et al.  MelGlow: Efficient Waveform Generative Network Based on Location-Variable Convolution , 2020, ArXiv.

[6]  Guillaume Fuchs,et al.  StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Ryuichi Yamamoto,et al.  Parallel Waveform Synthesis Based on Generative Adversarial Networks with Voicing-Aware Conditional Discriminators , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[9]  Kurt Keutzer,et al.  FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge , 2020, ArXiv.

[10]  D. Lim,et al.  Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains , 2020, ArXiv.

[11]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[12]  Youngik Kim,et al.  VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network , 2020, INTERSPEECH.

[13]  Gordon Wetzstein,et al.  Implicit Neural Representations with Periodic Activation Functions , 2020, NeurIPS.

[14]  FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction , 2020, INTERSPEECH.

[15]  Wei Ping,et al.  WaveFlow: A Compact Flow-based Model for Raw Audio , 2019, ICML.

[16]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[18]  Chengzhu Yu,et al.  DurIAN: Duration Informed Attention Network For Multimodal Synthesis , 2019, ArXiv.

[19]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[20]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[23]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[24]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[25]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[27]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[28]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[29]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[30]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[33]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).