Context-dependent additive log f_0 model for HMM-based speech synthesis

Abstract Thispaperproposesacontext-dependentadditiveacousticmod-elling technique and its application to logarithmic fundamentalfrequency (logF 0 ) modelling for HMM-based speech synthe-sis. Intheproposedtechnique,meanvectorsofstate-outputdis-tributions are composed as the weighted sum of decision tree-clustered context-dependent bias terms. Its model parametersand decision trees are estimated and built based on the maxi-mumlikelihood(ML)criterion. Theproposedtechniquehasthepotential to capture the additive structure of logF 0 contours. Apreliminary experiment using a small database showed that theproposed technique yielded encouraging results. Index Terms : speech synthesis, HMMs, logF 0 modelling 1. Introduction Hidden Markov model (HMM)-based speech synthesis [1] hasgrowninpopularityinrecentyears. Inthisframework,thespec-trum, excitation, and durations of speech are modelled simul-taneously in a unified framework of HMMs. For a given textto be synthesized, speech parameter trajectories that maximisetheir output probabilities are generated from estimated HMMsunder constraints between static and dynamic features [2]. Typ-ical instances of this framework use mel-cepstral coefficientsor line spectral pairs for their spectral parameters and logF

[1]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Heiga Zen,et al.  Tying covariance matrices to reduce the footprint of HMM-based speech synthesis systems , 2009, INTERSPEECH.

[3]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[4]  Yoshinori Sagisaka,et al.  Statistical modelling of speech segment duration by constrained tree regression , 2000 .

[5]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[6]  Hiroya Fujisaki,et al.  In search of models in speech communication research , 2009, INTERSPEECH.

[7]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[8]  Heiga Zen,et al.  Acoustic modeling with contextual additive structure for HMM-based speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Shinsuke Sakai F0 modeling with multi-layer additive modeling based on a statistical learning technique , 2004, SSW.

[10]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[11]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[12]  M. Saunders,et al.  Solution of Sparse Indefinite Systems of Linear Equations , 1975 .

[13]  Ren-Hua Wang,et al.  Articulatory control of HMM-based parametric speech synthesis driven by phonetic knowledge , 2008, INTERSPEECH.

[14]  Edward I. George,et al.  Bayesian Ensemble Learning , 2006, NIPS.

[15]  J. Friedman Stochastic gradient boosting , 2002 .

[16]  Frank K. Soong,et al.  Generating natural F0 trajectory with additive trees , 2008, INTERSPEECH.

[17]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..