论文信息 - Subband wavenet with overlapped single-sideband filterbanks

Subband wavenet with overlapped single-sideband filterbanks

Compared with conventional vocoders, deep neural network-based raw audio generative models, such as WaveNet and SampleRNN, can more naturally synthesize speech signals, although the synthesis speed is a problem, especially with high sampling frequency. This paper provides subband WaveNet based on multirate signal processing for high-speed and high-quality synthesis with raw audio generative models. In the training stage, speech waveforms are decomposed and decimated into subband short waveforms with a low sampling rate, and each subband WaveNet network is trained using each subband stream. In the synthesis stage, each generated signal is up-sampled and integrated into a fullband speech signal. The results of objective and subjective experiments for unconditional WaveNet with a sampling frequency of 32 kHz indicate that the proposed subband WaveNet with a square-root Hann window-based overlapped 9-channel single-sideband filterbank can realize about four times the synthesis speed and improve the synthesized speech quality more than the conventional fullband WaveNet.

[1] Shuang Xu,et al. First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention , 2016, INTERSPEECH.

[2] Hisashi Kawai,et al. Deep neural network-based power spectrum reconstruction to improve quality of vocoded speech with limited acoustic parameters , 2018 .

[3] Heiga Zen,et al. Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices , 2016, INTERSPEECH.

[4] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[5] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6] Yannis Agiomyrgiannakis,et al. Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Inma Hernáez,et al. Harmonics Plus Noise Model Based Vocoder for Statistical Parametric Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[8] Heiga Zen,et al. Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[9] Yoshua Bengio,et al. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[10] Bajibabu Bollepalli,et al. GlottDNN - A Full-Band Glottal Vocoder for Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.

[11] Junichi Yamagishi,et al. A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13] Alexander Gutkin,et al. Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer , 2016, INTERSPEECH.

[14] Heiga Zen,et al. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Heiga Zen,et al. Directly modeling voiced and unvoiced components in speech waveforms by neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[17] Tomoki Toda,et al. Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[18] Heiga Zen,et al. Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[19] Tomoki Toda,et al. Model Integration for HMM- and DNN-Based Speech Synthesis Using Product-of-Experts Framework , 2016, INTERSPEECH.

[20] Keiichi Tokuda,et al. An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[22] Adam Coates,et al. Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[23] Masanori Morise,et al. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[24] Yoshua Bengio,et al. Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[25] Heiga Zen,et al. Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26] Xin Wang,et al. A Comparative Study of the Performance of HMM, DNN, and RNN based Speech Synthesis Systems Trained on Very Large Speaker-Dependent Corpora , 2016, SSW.

[27] Frank K. Soong,et al. TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[28] Thomas S. Huang,et al. Fast Generation for Convolutional Autoregressive Models , 2017, ICLR.