Samplernn-Based Neural Vocoder for Statistical Parametric Speech Synthesis

This paper presents a SampleRNN-based neural vocoder for statistical parametric speech synthesis. This method utilizes a conditional SampleRNN model composed of a hierarchical structure of GRU layers and feed-forward layers to capture long-span dependencies between acoustic features and waveform sequences. Compared with conventional vocoders based on the source-filter model, our proposed vocoder is trained without assumptions derived from the prior knowledge of speech production and is able to provide a better modeling and recovery of phase information. Objective and subjective evaluations are conducted on two corpora. Experimental results suggested that our proposed vocoder can achieve higher quality of synthetic speech than the STRAIGHT vocoder and a WaveNet-based neural vocoder with similar run-time efficiency, no matter natural or predicted acoustic features are used as inputs.

[1]  Phil Clendeninn The Vocoder , 1940, Nature.

[2]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[3]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[4]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[5]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[6]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[7]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[9]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[10]  Ren-Hua Wang,et al.  The USTC System for Blizzard Challenge 2010 , 2008 .

[11]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[12]  Thomas S. Huang,et al.  Fast Wavenet Generation Algorithm , 2016, ArXiv.

[13]  D. Paul The spectral envelope estimation vocoder , 1981 .

[14]  Logan Volkers,et al.  PHASE VOCODER , 2008 .

[15]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[16]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[17]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[18]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[19]  前園 研一 Channel Vocoderの研究〔博論要旨〕 , 1969 .

[20]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.