A multi-level context-dependent prosodic model applied to durational modeling

We present in this article a multi-level prosodic model based on the estimation of prosodic parameters on a set of well defined linguistic units. Different linguistic units are used to represent different scales of prosodic variations (local and global forms) and thus to estimate the linguistic factors that can explain the variations of prosodic parameters independently on each level. This model is applied to the modeling of syllablebased durational parameters on two read speech corpora - laboratory and acted speech. Compared to a syllable-based baseline model, the proposed approach improves performance in terms of the temporal organization of the predicted durations (correlation score) and reduces model’s complexity, when showing comparable performance in terms of relative prediction error. Index Terms : speech synthesis, prosody, multi-level model, context-dependent model.

[1]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[2]  Bayya Yegnanarayana,et al.  Modeling durations of syllables using neural networks , 2007, Comput. Speech Lang..

[3]  Bleicke Holm SFC, un modèle de superposition de contours multiparamétriques pour la génération automatique de la prosodie : apprentissage automatique et application à l'énonciation de formules mathématiques , 2003 .

[4]  Takao Kobayashi,et al.  Phone duration modeling using gradient tree boosting , 2008, Speech Commun..

[5]  Sin-Horng Chen,et al.  A new duration modeling approach for Mandarin speech , 2003, IEEE Trans. Speech Audio Process..

[6]  Zhizheng Wu,et al.  Duration refinement by jointly optimizing state and longer unit likelihood , 2008, INTERSPEECH.

[7]  Olivier Boëffard,et al.  Generating intonation from a mixed CART-HMM model for speech synthesis , 2008, INTERSPEECH.

[8]  Masami Akamine,et al.  Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.

[9]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[10]  M. Avanzi,et al.  LES PARENTHESES EN FRANÇAIS , 2010 .

[11]  Yann Morlec Génération multiparamétrique de la prosodie du français par apprentissage automatique , 1997 .

[12]  Esther Klabbers,et al.  Decomposition of pitch curves in the general superpositional intonation model , 2006 .

[13]  F.J. Koopmans-van Beinum,et al.  Relationship between discourse structure and dynamic speech rate , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[14]  Xavier Rodet,et al.  IrcamCorpusTools: an Extensible Platform for Spoken Corpora Exploitation , 2008, LREC.

[15]  Xavier Rodet,et al.  Automatic Phoneme Segmentation with Relaxed Textual Constraints , 2008, LREC.

[16]  F. Béchet LIA―PHON: Un système complet de phonétisation de textes , 2001 .

[17]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[18]  Xavier Rodet,et al.  A Syllable-Based Prominence Detection Model Based on Discriminant Analysis and Context-Dependency , 2009 .