High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution. From the resulting super-resolution spectrogram networks, we can generate enhanced spectrograms to produce high quality synthesized speech. Our proposed model achieves improved mean opinion scores (MOS) of 3.71 and 4.01 over baseline results of 3.29 and 3.84, while using vocoder Griffin-Lim and WaveNet, respectively.

[1]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[3]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[4]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[5]  Sharon Gannot,et al.  Speech Dereverberation Using Fully Convolutional Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[6]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[7]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[8]  Qingjie Liu,et al.  Road Extraction by Deep Residual U-Net , 2017, IEEE Geoscience and Remote Sensing Letters.

[9]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[11]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[12]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Hirokazu Kameoka,et al.  Generative Adversarial Network-Based Postfilter for STFT Spectrograms , 2017, INTERSPEECH.

[14]  Yang Gao,et al.  Voice Impersonation Using Generative Adversarial Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Trevor Cohn,et al.  A Unified Neural Architecture for Instrumental Audio Tasks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[17]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Hemant A. Patil,et al.  Time-Frequency Mask-based Speech Enhancement using Convolutional Generative Adversarial Network , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[20]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.