论文信息 - Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks

Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks

Deep Neural Networks (DNNs) have been shown to provide state-of-the-art performance over other baseline models in the task of predicting prosodic targets from text in a speechsynthesis system. However, prosody prediction can be affected by an interaction of short- and long-term contextual factors that a static model that depends on a fixed-size context window can fail to properly capture. In this work, we look at a recurrent formulation of neural networks (RNNs) that are deep in time and can store state information from an arbitrarily large input history when making a prediction. We show that RNNs provide improved performance over DNNs of comparable size in terms of various objective metrics for a variety of prosodic streams (notably, a relative reduction of about 6% in F0 mean-square error accompanied by a relative increase of about 14% in F0 variance), as well as in terms of perceptual quality assessed through mean-opinion-score listening tests. Index Terms: speech synthesis,text-to-speech, prosody prediction, recurrent neural networks, deep learning

Bhuvana Ramabhadran | Asaf Rendel | Ron Hoory | Raul Fernandez

[1] Jürgen Schmidhuber,et al. Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[2] Martin A. Riedmiller,et al. A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[3] Dong Yu,et al. Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] T. Munich,et al. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[5] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6] Jürgen Schmidhuber,et al. Learning to forget: continual prediction with LSTM , 1999 .

[7] Keiichi Tokuda,et al. A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[8] Bhuvana Ramabhadran,et al. F0 contour prediction with a deep belief network-Gaussian process hybrid model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10] Heiga Zen,et al. Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11] Cha Zhang,et al. CROWDMOS: An approach for crowdsourcing mean opinion score studies , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Helen M. Meng,et al. Multi-distribution deep belief network for speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Dong Yu,et al. Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[15] Jasha Droppo,et al. Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16] Stefan Ie Shattuck-Hufnagel,et al. Phrase-Level Phonology in Speech Production Planning: Evidence for the Role of Prosodic Structure , 2000 .