Decision tree usage for incremental parametric speech synthesis

Human speakers plan and deliver their utterances incrementally, piece-by-piece, and it is obvious that their choice regarding phonetic details (and the details' peculiarities) is rarely determined by globally optimal solutions. In contrast, parametric speech synthesizers use a full-utterance context when optimizing vocoding parameters and when determing HMM states. Apart from being cognitively implausible, this impedes incremental use-cases, where the future context is often at least partially unavailable. This paper investigates the `locality' of features in parametric speech synthesis voices and takes some missing steps towards better HMM state selection and prosody modelling for incremental speech synthesis.

[1]  David Schlangen,et al.  The InproTK 2012 release , 2012, SDCTD@NAACL-HLT.

[2]  Florian Schiel,et al.  The BITS Speech Synthesis Corpus for German , 2004, LREC.

[3]  W. Levelt,et al.  Speaking: From Intention to Articulation , 1990 .

[4]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[5]  David Schlangen,et al.  INPRO_iSS: A Component for Just-In-Time Incremental Speech Synthesis , 2012, ACL.

[6]  Raymond J. Mooney,et al.  Learning to sportscast: a test of grounded language acquisition , 2008, ICML '08.

[7]  Caren Brinckmann,et al.  The Role of Duration Models and Symbolic Representation for Timing in Synthetic Speech , 2003, Int. J. Speech Technol..

[8]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Thierry Dutoit,et al.  PHTS FOR MAX/MSP: A STREAMING ARCHITECTURE FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS , 2011 .

[10]  David Schlangen,et al.  Evaluating Prosodic Processing for Incremental Speech Synthesis , 2012, INTERSPEECH.

[11]  Oliver Watts,et al.  The role of higher-level linguistic features in HMM-based speech synthesis , 2010, INTERSPEECH.

[12]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[13]  Petr Motlícek,et al.  On the (UN)importance of the contextual factors in HMM-based speech synthesis and coding , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[15]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[16]  Timo Baumann,et al.  Incremental spoken dialogue processing: architecture and lower-level components , 2013 .

[17]  Gabriel Skantze,et al.  A General, Abstract Model of Incremental Dialogue Processing , 2011 .

[18]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[19]  Thierry Dutoit,et al.  Reactive and continuous control of HMM-based speech synthesis , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[20]  Srinivas Bangalore,et al.  Real-time Incremental Speech-to-Speech Translation of Dialogs , 2012, NAACL.