Modulation spectrum-constrained trajectory training algorithm for HMM-based speech synthesis

This paper presents a novel training algorithm for Hidden Markov Model (HMM)-based speech synthesis. One of the biggest issues causing significant quality degradation in synthetic speech is the over-smoothing effect often observed in generated speech parameter trajectories. Recently, we have found that a Modulation Spectrum (MS) of the generated speech parameters is sensitively correlated with the over-smoothing effect, and have proposed the parameter generation algorithm considering the MS. The over-smoothing effect is effectively alleviated by the proposed parameter generation algorithm. On the other hand, it loses the computationally-efficient generation processing of the conventional generation algorithm. In this paper, the MS is integrated into the training stage instead of the parameter generation stage in a similar manner as our previous work on Gaussian Mixture Model (GMM)-based spectral parameter trajectory conversion. The trajectory HMM is trained with a novel objective function consisting of both the conventional trajectory HMM likelihood and a newly implemented MS likelihood. This training framework is further extended to the F0 component. The experimental results demonstrate that the proposed algorithm yields improvements in synthetic speech quality while preserving a capability of the computationallyefficient generation processing.

[1]  Takashi Nose,et al.  Analysis of spectral enhancement using global variance in HMM-based speech synthesis , 2014, INTERSPEECH.

[2]  Yoshihiko Nankaku,et al.  Integration of speaker and pitch adaptive training for HMM-based singing voice synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[4]  Tomoki Toda,et al.  Generalizing continuous-space translation of paralinguistic information , 2013, INTERSPEECH.

[5]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6]  Tomoki Toda,et al.  Modulation spectrum-constrained trajectory training algorithm for GMM-based Voice Conversion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Tomoki Toda,et al.  Trajectory training considering global variance for HMM-based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[9]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[10]  Yu Tsao,et al.  Incorporating global variance in the training phase of GMM-based voice conversion , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[11]  S. King,et al.  Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction , 2012 .

[12]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[13]  Tomoki Toda,et al.  Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[14]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Tomoki Toda,et al.  Parameter generation algorithm considering Modulation Spectrum for HMM-based speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[18]  Tomoki Toda,et al.  Modulation spectrum-based post-filter for GMM-based Voice Conversion , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[19]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[20]  William J. Byrne,et al.  Fast, low-artifact speech synthesis considering global variance , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Tomoki Toda,et al.  A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[23]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[24]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[25]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[26]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[27]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[28]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[30]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..