Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network

In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generation. However, the pure-data-driven nature and the lack of prior knowledge of audio signals degrade the pitch controllability of WN. For instance, it is difficult for WN to precisely generate the periodic components of audio signals when the given auxiliary fundamental frequency (<inline-formula><tex-math notation="LaTeX">$F_{0}$</tex-math></inline-formula>) features are outside the <inline-formula><tex-math notation="LaTeX">$F_{0}$</tex-math></inline-formula> range observed in the training data. To address this problem, QPNet with two novel designs is proposed. First, the PDCNN component is applied to dynamically change the network architecture of WN according to the given auxiliary <inline-formula><tex-math notation="LaTeX">$F_{0}$</tex-math></inline-formula> features. Second, a cascaded network structure is utilized to simultaneously model the long- and short-term dependencies of quasi-periodic signals such as speech. The performances of single-tone sinusoid and speech generations are evaluated. The experimental results show the effectiveness of the PDCNNs for unseen auxiliary <inline-formula><tex-math notation="LaTeX">$F_{0}$</tex-math></inline-formula> features and the effectiveness of the cascaded structure for speech generation.

[1]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Phil Clendeninn The Vocoder , 1940, Nature.

[3]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[4]  Bajibabu Bollepalli,et al.  Speaker-independent raw waveform model for glottal excitation , 2018, INTERSPEECH.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Tomoki Toda,et al.  The NU Non-Parallel Voice Conversion System for the Voice Conversion Challenge 2018 , 2018, Odyssey.

[7]  Xin Wang,et al.  Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Vassilis Tsiaras,et al.  ON the Use of Wavenet as a Statistical Vocoder , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Logan Volkers,et al.  PHASE VOCODER , 2008 .

[10]  M. Mathews,et al.  Pitch Synchronous Analysis of Voiced Sounds , 1961 .

[11]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[12]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[13]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[14]  Karen Simonyan,et al.  The challenge of realistic music generation: modelling raw audio at scale , 2018, NeurIPS.

[15]  Adam Finkelstein,et al.  Fftnet: A Real-Time Speaker-Dependent Neural Vocoder , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Yoshihiko Nankaku,et al.  Deep neural network based real-time speech vocoder with periodic and aperiodic inputs , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[17]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Sercan Ömer Arik,et al.  Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.

[19]  Tomoki Toda,et al.  NU Voice Conversion System for the Voice Conversion Challenge 2018 , 2018, Odyssey.

[20]  Tomoki Toda,et al.  An investigation of multi-speaker training for wavenet vocoder , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[21]  Tomoki Toda,et al.  Non-Parallel Voice Conversion System With WaveNet Vocoder and Collapsed Speech Suppression , 2020, IEEE Access.

[22]  C.E. Shannon,et al.  Communication in the Presence of Noise , 1949, Proceedings of the IRE.

[23]  Quan Wang,et al.  Wavenet Based Low Rate Speech Coding , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[25]  Masanori Morise,et al.  CheapTrick, a spectral envelope estimator for high-quality speech synthesis , 2015, Speech Commun..

[26]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[27]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[28]  Zhen-Hua Ling,et al.  Samplernn-Based Neural Vocoder for Statistical Parametric Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[30]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Tomoki Toda,et al.  An Investigation of Noise Shaping with Perceptual Weighting for Wavenet-Based Speech Generation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Tomoki Toda,et al.  Collapsed speech segment detection and suppression for WaveNet vocoder , 2018, INTERSPEECH.

[33]  Tomoki Toda,et al.  Statistical Voice Conversion with WaveNet-Based Waveform Generation , 2017, INTERSPEECH.

[34]  Tomoki Toda,et al.  Quasi-Periodic WaveNet Vocoder: A Pitch Dependent Dilated Convolution Model for Parametric Speech Generation , 2019, INTERSPEECH.

[35]  Tomoki Toda,et al.  sprocket: Open-Source Voice Conversion Software , 2018, Odyssey.

[36]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[37]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[38]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[39]  Tomoki Toda,et al.  Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[40]  Manfred R. Schroeder,et al.  Code-excited linear prediction(CELP): High-quality speech at very low bit rates , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[42]  Bishnu S. Atal,et al.  Improving performance of multi-pulse LPC coders at low bit rates , 1984, ICASSP.

[43]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[44]  Mark Hasegawa-Johnson,et al.  Speech Enhancement Using Bayesian Wavenet , 2017, INTERSPEECH.

[45]  W. Marsden I and J , 2012 .

[46]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[47]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[48]  Manfred R. Schroeder,et al.  Vocoders: Analysis and synthesis of speech , 1966 .

[49]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[50]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Xi Wang,et al.  A New Glottal Neural Vocoder for Speech Synthesis , 2018, INTERSPEECH.

[52]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[53]  Xin Wang,et al.  Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Amro El-Jaroudi,et al.  Discrete all-pole modeling , 1991, IEEE Trans. Signal Process..

[55]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[56]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[57]  Biing-Hwang Juang,et al.  An 800 bit/s vector quantization LPC vocoder , 1982 .

[58]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .