An RNN-Based Quantized F0 Model with Multi-Tier Feedback Links for Text-to-Speech Synthesis

A recurrent-neural-network-based F0 model for text-to-speech (TTS) synthesis that generates F0 contours given textual features is proposed. In contrast to related F0 models, the proposed one is designed to learn the temporal correlation of F0 contours at multiple levels. The frame-level correlation is covered by feeding back the F0 output of the previous frame as the additional input of the current frame; meanwhile, the correlation over long-time spans is similarly modeled but by using F0 features aggregated over the phoneme and syllable. Another difference is that the output of the proposed model is not the interpolated continuous-valued F0 contour but rather a sequence of discrete symbols, including quantized F0 levels and a symbol for the unvoiced condition. By using the discrete F0 symbols, the proposed model avoids the influence of artificially interpolated F0 curves. Experiments demonstrated that the proposed F0 model, which was trained using a dropout strategy, generated smooth F0 contours with relatively better perceived quality than those from baseline RNN models.

[1]  Xuanjing Huang,et al.  Multi-Timescale Long Short-Term Memory Neural Network for Modelling Sentences and Documents , 2015, EMNLP.

[2]  Li-Rong Dai,et al.  Multi-Layer F0 Modeling for HMM-Based Speech Synthesis , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[3]  Antoine Raux,et al.  A unit selection approach to F0 modeling and its application to emphasis , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[4]  Takashi Nose,et al.  HMM-based speech synthesis with unsupervised labeling of accentual context based on F0 quantization and average voice model , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Xuejing Sun F0 generation for speech synthesis using a multi-tier approach , 2002, INTERSPEECH.

[6]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Xin Wang,et al.  Investigating very deep highway networks for parametric speech synthesis , 2018, Speech Commun..

[8]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  Masami Akamine,et al.  Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.

[10]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[11]  Wenlin Chen,et al.  Strategies for Training Large Vocabulary Neural Language Models , 2015, ACL.

[12]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[13]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[14]  Patricia Riddle,et al.  Modelling and synthesising F0 contours with the discrete cosine transform , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Robert A. J. Clark,et al.  A multi-level representation of f0 using the continuous wavelet transform and the Discrete Cosine Transform , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Christof Traber F0 generation with a data base of natural F0 patterns and with a neural network , 1990, SSW.

[17]  G. Ayers,et al.  Guidelines for ToBI labelling , 1994 .

[18]  Andrew Rosenberg Classification of Prosodic Events using Quantized Contour Modeling , 2010, HLT-NAACL.

[19]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[20]  Santitham Prom-on,et al.  Modeling tone and intonation in Mandarin and English as a process of target approximation. , 2009, The Journal of the Acoustical Society of America.

[21]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[22]  Paul Taylor,et al.  Using decision trees within the tilt intonation model to predict F0 contours , 1999, EUROSPEECH.

[23]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[24]  C. Gussenhoven The phonology of tone and intonation , 2004 .

[25]  Mandy Eberhart,et al.  Speech Communications Human And Machine , 2016 .

[26]  Björn W. Schuller,et al.  Introducing CURRENNT: the munich open-source CUDA recurrent neural network toolkit , 2015, J. Mach. Learn. Res..

[27]  Bhuvana Ramabhadran,et al.  Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks , 2014, INTERSPEECH.

[28]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[30]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[31]  B. Moore An introduction to the psychology of hearing, 3rd ed. , 1989 .

[32]  Mark J. F. Gales,et al.  Training a supra-segmental parametric F0 model without interpolating F0 , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[34]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[35]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[36]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[37]  Jürgen Schmidhuber,et al.  A Clockwork RNN , 2014, ICML.

[38]  Joram Meron Prosodic unit selection using an imitation speech database , 2001, SSW.

[39]  Ferenc Huszar,et al.  How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary? , 2015, ArXiv.

[40]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[41]  Y. Sagisaka,et al.  On the prediction of global F/sub 0/ shape for Japanese text-to-speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.