论文信息 - Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning - 字舞流文

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.

Sercan Ömer Arik | John Miller | Wei Ping | Sharan Narang | Jonathan Raiman | Andrew Gibiansky | Kainan Peng | Ajay Kannan | Sercan Ö. Arik | Sharan Narang | Jonathan Raiman | Andrew Gibiansky | John Miller | Wei Ping | Kainan Peng | Ajay Kannan

[1] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[2] Lior Wolf,et al. Voice Synthesis for in-the-Wild Speakers via a Phonological Loop , 2017, ArXiv.

[3] Adam Coates,et al. Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[4] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5] Simon King,et al. Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[6] Jae Lim,et al. Signal estimation from modified short-time Fourier transform , 1984 .

[7] Heiga Zen,et al. Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[8] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[9] A. Algorithms. Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017 .

[10] Jason Weston,et al. A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[11] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Zhizheng Wu,et al. Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System , 2017, INTERSPEECH.

[13] Masanori Morise,et al. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[14] Yoshua Bengio,et al. Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[15] Paul Taylor,et al. Text-to-Speech Synthesis , 2009 .

[16] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[17] Yannis Agiomyrgiannakis,et al. Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[19] Sercan Ömer Arik,et al. Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[20] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[21] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[22] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[23] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[24] Alexander Gutkin,et al. Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer , 2016, INTERSPEECH.

[25] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[26] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[27] Cha Zhang,et al. CROWDMOS: An approach for crowdsourcing mean opinion score studies , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Yoshua Bengio,et al. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.