Char2Wav: End-to-End Speech Synthesis

We present Char2Wav, an end-to-end model for speech synthesis. Char2Wav has two components: a reader and a neural vocoder. The reader is an encoderdecoder model with attention. The encoder is a bidirectional recurrent neural network that accepts text or phonemes as inputs, while the decoder is a recurrent neural network (RNN) with attention that produces vocoder acoustic features. Neural vocoder refers to a conditional extension of SampleRNN which generates raw waveform samples from intermediate representations. Unlike traditional models for speech synthesis, Char2Wav learns to produce audio directly from text.

[1]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[3]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[5]  Joaquim Llisterri,et al.  The Corpus DIMEx100: transcription and evaluation , 2010, Lang. Resour. Evaluation.

[6]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[7]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[9]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[10]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[11]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[12]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[13]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[14]  Simon King,et al.  Measuring a decade of progress in Text-to-Speech , 2014 .

[15]  Heiga Zen,et al.  Statistical parametric speech synthesis: from HMM to LSTM-RNN , 2015 .

[16]  Zhizheng Wu,et al.  A study of speaker adaptation for DNN-based speech synthesis , 2015, INTERSPEECH.

[17]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[18]  Yoshua Bengio,et al.  Blocks and Fuel: Frameworks for deep learning , 2015, ArXiv.

[19]  Fuchun Peng,et al.  Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Heiga Zen,et al.  Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Frank K. Soong,et al.  Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[25]  Geoffrey Zweig,et al.  Sequence-to-sequence neural net models for grapheme-to-phoneme conversion , 2015, INTERSPEECH.

[26]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Heiga Zen,et al.  Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices , 2016, INTERSPEECH.

[29]  Ole Winther,et al.  Neural Machine Translation with Characters and Hierarchical Encoding , 2016, ArXiv.

[30]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Heiga Zen,et al.  Directly modeling voiced and unvoiced components in speech waveforms by neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[33]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[34]  Heiga Zen,et al.  Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN Based Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.

[35]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[36]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[37]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[38]  Stephen DiVerdi,et al.  Cute: A concatenative method for voice conversion using exemplar-based unit selection , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.