Using Cyclic Noise as the Source Signal for Neural Source-Filter-Based Speech Waveform Model

Neural source-filter (NSF) waveform models generate speech waveforms by morphing sine-based source signals through dilated convolution in the time domain. Although the sine-based source signals help the NSF models to produce voiced sounds with specified pitch, the sine shape may constrain the generated waveform when the target voiced sounds are less periodic. In this paper, we propose a more flexible source signal called cyclic noise, a quasi-periodic noise sequence given by the convolution of a pulse train and a static random noise with a trainable decaying rate that controls the signal shape. We further propose a masked spectral loss to guide the NSF models to produce periodic voiced sounds from the cyclic noise-based source signal. Results from a large-scale listening test demonstrated the effectiveness of the cyclic noise and the masked spectral loss on speaker-independent NSF models in copy-synthesis experiments on the CMU ARCTIC database.

[1]  A. Rosenberg Effect of glottal pulse shape on the quality of natural vowels. , 1969 .

[2]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[3]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[4]  Seppo J. Ovaska,et al.  Speech signal restoration using an optimal neural network structure , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[5]  Xin Wang,et al.  Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis , 2019, ArXiv.

[6]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[7]  Thierry Dutoit,et al.  The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Heiga Zen,et al.  Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tuomo Raitio,et al.  Excitation modeling for HMM-based speech synthesis: Breaking down the impact of periodic and aperiodic components , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Heiga Zen,et al.  An excitation model for HMM-based speech synthesis based on residual modeling , 2007, SSW.

[11]  Paavo Alku,et al.  HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Peter Ladefoged,et al.  Phonation types: a cross-linguistic overview , 2001, J. Phonetics.

[13]  A. Rosenberg Effect of glottal pulse shape on the quality of natural vowels. , 1969, The Journal of the Acoustical Society of America.

[14]  Bajibabu Bollepalli,et al.  GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram , 2019, INTERSPEECH.

[15]  Kumar Krishna Agrawal,et al.  GANSynth: Adversarial Neural Audio Synthesis , 2019, ICLR.

[16]  Lauri Juvela,et al.  Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Chenjie Gu,et al.  DDSP: Differentiable Digital Signal Processing , 2020, ICLR.

[18]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[19]  Simon King,et al.  Speech Waveform Reconstruction Using Convolutional Neural Networks with Noise and Periodic Inputs , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[22]  Stefano Ermon,et al.  Audio Super Resolution using Neural Networks , 2017, ICLR.

[23]  Bajibabu Bollepalli,et al.  GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Xin Wang,et al.  Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Alex Waibel,et al.  Noise reduction using connectionist models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[27]  Zhen-Hua Ling,et al.  A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Xin Wang,et al.  Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Junichi Yamagishi,et al.  Towards an improved modeling of the glottal source in statistical parametric speech synthesis , 2007, SSW.

[30]  Lauri Juvela,et al.  Phase perception of the glottal excitation and its relevance in statistical parametric speech synthesis , 2016, Speech Commun..

[31]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  W. Bastiaan Kleijn,et al.  On phase perception in speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[33]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  John Kane,et al.  Data-driven detection and analysis of the patterns of creaky voice , 2014, Comput. Speech Lang..