Submission from CMU for Blizzard Challenge 2019

In this paper we present the entry from CMU to Blizzard speech synthesis challenge 2019. We begin with a description of build process for our base voice. We then present the following modifications to base voice: (1) We investigate the effectiveness of sub-sentence training of acoustic models aimed at better utilization of available aligned data (2) We investigate the applicability of strategic gradient backpropagation to accelerate the training (3) We experiment with iterated dilated convolutions in WaveNet to obtain compact models. Although our current performance seems very inefficient, we are actively pursuing approaches to strengthen our voice building framework. We believe we are progressing in the right direction and anticipate a much stronger performance in the coming evaluations.

[1]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[2]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[3]  Heiga Zen,et al.  Sample Efficient Adaptive Text-to-Speech , 2018, ICLR.

[4]  Alan W. Black,et al.  CMU Wilderness Multilingual Speech Dataset , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Fadi Biadsy,et al.  Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation , 2019, INTERSPEECH.

[6]  James Glass,et al.  Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Bajibabu Bollepalli,et al.  Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jeffrey Dean,et al.  Accelerating Deep Learning by Focusing on the Biggest Losers , 2019, ArXiv.

[9]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[10]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[11]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[12]  Arun Baby,et al.  Resources for Indian languages , 2016 .

[13]  Yoshua Bengio,et al.  Representation Mixing for TTS Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Andrew McCallum,et al.  Fast and Accurate Entity Recognition with Iterated Dilated Convolutions , 2017, EMNLP.

[15]  Alan W. Black,et al.  Optimizing segment label boundaries for statistical speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Bajibabu Bollepalli,et al.  Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention , 2018, ArXiv.

[17]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[18]  Shiri Bendelac,et al.  Enhanced Neural Network Training Using Selective Backpropagation and Forward Propagation , 2018 .

[19]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[20]  Taesu Kim,et al.  Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Benjamin Graham,et al.  Fractional Max-Pooling , 2014, ArXiv.

[22]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[24]  Sunita Sarawagi,et al.  Surprisingly Easy Hard-Attention for Sequence to Sequence Learning , 2018, EMNLP.

[25]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[26]  Tomoki Toda,et al.  Collapsed speech segment detection and suppression for WaveNet vocoder , 2018, INTERSPEECH.

[27]  Yuxuan Wang,et al.  Uncovering Latent Style Factors for Expressive Speech Synthesis , 2017, ArXiv.

[28]  Sanjoy Dasgupta,et al.  A neural algorithm for a fundamental computing problem , 2017 .

[29]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[30]  Young-Joo Suh,et al.  An end-to-end synthesis method for Korean text-to-speech systems , 2018 .