Deep neural network context embeddings for model selection in rich-context HMM synthesis

This paper introduces a novel form of parametric synthesis that uses context embeddings produced by the bottleneck layer of a deep neural network to guide the selection of models in a rich-context HMM-based synthesiser. Rich-context synthesis – in which Gaussian distributions estimated from single linguistic contexts seen in the training data are used for synthesis, rather than more conventional decision tree-tied models – was originally proposed to address over-smoothing due to averaging across contexts. Our previous investigations have confirmed experimentally that averaging across different contexts is indeed one of the largest factors contributing to the limited quality of statistical parametric speech synthesis. However, a possible weakness of the rich context approach as previously formulated is that a conventional tied model is still used to guide selection of Gaussians at synthesis time. Our proposed approach replaces this with context embeddings derived from a neural network.

[1]  Cassia Valentini-Botinhao,et al.  Are we using enough listeners? no! - an empirically-supported critique of interspeech 2014 TTS evaluations , 2015, INTERSPEECH.

[2]  Simon King,et al.  Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Simon King,et al.  Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Simon King,et al.  Measuring a decade of progress in Text-to-Speech , 2014 .

[5]  Tomoki Toda,et al.  Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[6]  Cassia Valentini-Botinhao,et al.  Hurricane natural speech corpus , 2013 .

[7]  Simon King,et al.  Investigating the shortcomings of HMM synthesis , 2013, SSW.

[8]  Simon King,et al.  An introduction to statistical parametric speech synthesis , 2011 .

[9]  Zhi-Jie Yan,et al.  RIch-context Unit Selection (RUS) approach to high quality TTS , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Frank K. Soong,et al.  A cross-language state mapping approach to bilingual (Mandarin-English) TTS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[12]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[15]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[16]  Simon King,et al.  Investigating source and filter contributions, and their interaction, to statistical parametric speech synthesis , 2014, INTERSPEECH.

[17]  C. Mayo,et al.  Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech , 2014, INTERSPEECH.

[18]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[19]  Method for the subjective assessment of intermediate quality level of , 2014 .

[20]  Tomoki Toda,et al.  An Evaluation of Parameter Generation Methods with Rich Context Models in HMM-Based Speech Synthesis , 2012, INTERSPEECH.

[21]  S. King,et al.  The Blizzard Challenge 2012 , 2012 .

[22]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[23]  S. King,et al.  The Blizzard Challenge 2010 , 2010 .

[24]  Zhi-Jie Yan,et al.  Rich context modeling for high quality HMM-based TTS , 2009, INTERSPEECH.

[25]  Paul Taylor,et al.  The target cost formulation in unit selection speech synthesis , 2006, INTERSPEECH.