The CMU entry to blizzard machine learning challenge

The paper describes Carnegie Mellon University's (CMU) entry to the ES-1 sub-task of the Blizzard Machine Learning Speech Synthesis Challenge 2017. The submitted system is a parametric model trained to predict vocoder parameters given linguistic features. The task in this year's challenge was to synthesize speech from children's audiobooks. Linguistic and acoustic features were provided by the organizers and the task was to find the best performing model. The paper explores various RNN architectures that were investigated and describes the final model that was submitted.

[1]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[2]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[3]  Jürgen Schmidhuber,et al.  A Clockwork RNN , 2014, ICML.

[4]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[5]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[6]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[7]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Hugo Larochelle,et al.  RNADE: The real-valued neural autoregressive density-estimator , 2013, NIPS.

[10]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[11]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[12]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[13]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[14]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[15]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[16]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[18]  Richard Socher,et al.  Quasi-Recurrent Neural Networks , 2016, ICLR.

[19]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.