论文信息 - A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis

A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis

Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches. In this paper, we build a framework in which we can fairly compare new vocoding and acoustic modeling techniques with conventional approaches by means of a large scale crowdsourced evaluation. Results on acoustic models showed that generative adversarial networks and an autoregressive (AR) model performed better than a normal recurrent network and the AR model performed best. Evaluation on vocoders by using the same AR acoustic model demonstrated that a Wavenet vocoder outperformed classical source-filter-based vocoders. Particularly, generated speech waveforms from the combination of AR acoustic model and Wavenet vocoder achieved a similar score of speech quality to vocoded speech.

[1] Shakir Mohamed,et al. Variational Inference with Normalizing Flows , 2015, ICML.

[2] Xin Wang,et al. An RNN-Based Quantized F0 Model with Multi-Tier Feedback Links for Text-to-Speech Synthesis , 2017, INTERSPEECH.

[3] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[4] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5] Junichi Yamagishi,et al. An autoregressive recurrent mixture density network for parametric speech synthesis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Hirokazu Kameoka,et al. Generative Adversarial Network-Based Postfilter for STFT Spectrograms , 2017, INTERSPEECH.

[7] Junichi Yamagishi,et al. Complex-Valued Restricted Boltzmann Machine for Direct Learning of Frequency Spectra , 2017, INTERSPEECH.

[8] Mark J. F. Gales,et al. A Log Domain Pulse Model for Parametric Speech Synthesis , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9] Junichi Yamagishi,et al. Initial investigation of speech synthesis based on complex-valued neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Mark J. F. Gales,et al. Complex cepstrum as phase information in statistical parametric speech synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Keiichi Tokuda,et al. XIMERA: a new TTS from ATR based on corpus-based technologies , 2004, SSW.

[12] Daniel Erro,et al. A uniform phase representation for the harmonic model in speech synthesis applications , 2014, EURASIP J. Audio Speech Music. Process..

[13] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[14] Srikanth Ronanki,et al. A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis , 2017, INTERSPEECH.

[15] Hirokazu Kameoka,et al. Generative adversarial network-based postfilter for statistical parametric speech synthesis , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Hirokazu Kameoka,et al. Direct Modeling of Frequency Spectra and Waveform Generation Based on Phase Recovery for DNN-Based Speech Synthesis , 2017, INTERSPEECH.

[17] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[18] Ren-Hua Wang,et al. The USTC System for Blizzard Challenge 2010 , 2008 .

[19] Björn W. Schuller,et al. Introducing CURRENNT: the munich open-source CUDA recurrent neural network toolkit , 2015, J. Mach. Learn. Res..

[20] Tomoki Toda,et al. Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[21] Masanori Morise,et al. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[22] Karen Simonyan,et al. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[23] Yoshua Bengio,et al. Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[24] Simon King,et al. Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis , 2017, INTERSPEECH.

[25] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..