论文信息 - Tacotron: Towards End-to-End Speech Synthesis

Tacotron: Towards End-to-End Speech Synthesis

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

[1] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[2] Jae Lim,et al. Signal estimation from modified short-time Fourier transform , 1984 .

[3] Joseph P. Olive,et al. Text-to-speech synthesis , 1995, AT&T Technical Journal.

[4] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[6] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[7] Yannis Agiomyrgiannakis,et al. Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Geoffrey E. Hinton,et al. Grammar as a Foreign Language , 2014, NIPS.

[9] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[10] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[12] Jürgen Schmidhuber,et al. Highway Networks , 2015, ArXiv.

[13] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[14] Heiga Zen,et al. Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices , 2016, INTERSPEECH.

[15] Alexander Gutkin,et al. Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer , 2016, INTERSPEECH.

[16] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).