Deep neural network context embeddings for model selection in rich-context HMM synthesis

This paper introduces a novel form of parametric synthesis that uses context embeddings produced by the bottleneck layer of a deep neural network to guide the selection of models in a rich-context HMM-based synthesiser. Rich-context synthesis – in which Gaussian distributions estimated from single linguistic contexts seen in the training data are used for synthesis, rather than more conventional decision tree-tied models – was originally proposed to address over-smoothing due to averaging across contexts. Our previous investigations have confirmed experimentally that averaging across different contexts is indeed one of the largest factors contributing to the limited quality of statistical parametric speech synthesis. However, a possible weakness of the rich context approach as previously formulated is that a conventional tied model is still used to guide selection of Gaussians at synthesis time. Our proposed approach replaces this with context embeddings derived from a neural network.

[1]  Frank K. Soong,et al.  A cross-language state mapping approach to bilingual (Mandarin-English) TTS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Paul Taylor,et al.  The target cost formulation in unit selection speech synthesis , 2006, INTERSPEECH.

[3]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[4]  Simon King,et al.  Measuring a decade of progress in Text-to-Speech , 2014 .

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[7]  Zhi-Jie Yan,et al.  Rich context modeling for high quality HMM-based TTS , 2009, INTERSPEECH.

[8]  Tomoki Toda,et al.  Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[9]  Simon King,et al.  Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[11]  S. King,et al.  The Blizzard Challenge 2012 , 2012 .

[12]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  Simon King,et al.  An introduction to statistical parametric speech synthesis , 2011 .

[14]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[15]  Zhi-Jie Yan,et al.  RIch-context Unit Selection (RUS) approach to high quality TTS , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Simon King,et al.  Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech , 2014, INTERSPEECH.

[17]  Simon King,et al.  Investigating source and filter contributions, and their interaction, to statistical parametric speech synthesis , 2014, INTERSPEECH.

[18]  Simon King,et al.  Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Cassia Valentini-Botinhao,et al.  Hurricane natural speech corpus , 2013 .

[20]  S. King,et al.  The Blizzard Challenge 2010 , 2010 .

[21]  Cassia Valentini-Botinhao,et al.  Are we using enough listeners? no! - an empirically-supported critique of interspeech 2014 TTS evaluations , 2015, INTERSPEECH.

[22]  Simon King,et al.  Investigating the shortcomings of HMM synthesis , 2013, SSW.

[23]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24]  Tomoki Toda,et al.  An Evaluation of Parameter Generation Methods with Rich Context Models in HMM-Based Speech Synthesis , 2012, INTERSPEECH.