论文信息 - Speech Super-Resolution Using Parallel WaveNet

Speech Super-Resolution Using Parallel WaveNet

Audio super-resolution is the task to increase the sampling rate of a given low-resolution (i.e. low sampling rate) audio. One of the most popular approaches for audio super-resolution is to minimize the squared Euclidean distance between the reconstructed signal and the high sampling rate signal in a point-wise manner. However, such approach has intrinsic limitations, such as the regression to mean problem. In this work, we introduce a novel auto-regressive method for the speech super-resolution task, which utilizes WaveNet to model the distribution of the target high-resolution signal conditioned on the log-scale mel-spectrogram of the low-resolution signal. As an auto-regressive neural network, WaveNet uses the negative log-likelihood as the objective function, which is much more suitable for highly stochastic process such as speech waveform, instead of the Euclidean distance. We also train a parallel WaveNet to speed up the generating process to real-time. In the experiments, we perform speech super-resolution by increasing the sampling rate from 4kHz to 16kHz on the VCTK corpus. The proposed method can achieve the improvement of ∼2 dB over the base-line deep residual convolutional neural network (CNN) under the Log-Spectral Distance (LSD) metric.

[1] Sercan Ömer Arik,et al. Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[2] Matthias Bethge,et al. A note on the evaluation of generative models , 2015, ICLR.

[3] Joan Bruna,et al. Super-Resolution with Deep Convolutional Sufficient Statistics , 2015, ICLR.

[4] Li-Rong Dai,et al. Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[7] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[8] Daniel Rueckert,et al. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Stefano Ermon,et al. Audio Super Resolution using Neural Networks , 2017, ICLR.

[10] A. Gray,et al. Distance measures for speech processing , 1976 .

[11] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Junichi Yamagishi,et al. SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[13] Minh N. Do,et al. Time-Frequency Networks for Audio Super-Resolution , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).