Average-Voice-Based Speech Synthesis

This thesis describes a novel speech synthesis framework " Average-Voice-based Speech Synthesis. " By using the speech synthesis framework, synthetic speech of arbitrary target speakers can be obtained robustly and steadily even if speech samples available for the target speaker are very small. This speech synthesis framework consists of speaker normalization algorithm for the parameter clustering, speaker normalization algorithm for the parameter estimation, the transformation/adaptation part, and modification part of the rough transformation. In the parameter clustering using decision-tree-based context clustering techniques for average voice model, the nodes of the decision tree do not always have training data of all speakers, and some nodes have data from only one speaker. This speaker-biased node causes degradation of quality of average voice and synthetic speech after speaker adaptation, especially in prosody. Therefore, we firstly propose a new context clustering technique, named " shared-decision-tree-based context clustering " to overcome this problem. Using this technique, every node of the decision tree always has training data from all speakers included in the training speech database. As a result, we can construct decision tree common to all training speakers and each distribution of the node always reflects the statistics of all speakers. However, when training data of each training speaker differs widely, the distributions of the node often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. Therefore, we incorporate " speaker adaptive training " into the parameter estimation procedure of average voice model to reduce the influence of speaker dependence. In the speaker adaptive training, the speaker difference between training speaker's voice and average voice is assumed to be expressed as a simple linear regres-i ii sion function of mean vector of the distribution and a canonical average voice model is estimated using the assumption. In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. Therefore, we utilize a framework of " hidden semi-Markov model " (HSMM) which is an HMM having explicit state duration distributions and we propose an HSMM-based model adaptation algorithm to simultaneously transform both state output and state duration distributions. Furthermore, we also propose an HSMM-based speaker adaptive training algorithm to normalize both state output and state duration distributions of average voice model at the same time. Finally, we explore several speaker adaptation algorithms to transform more effectively the average voice …

[1]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Masatsune Tamura,et al.  A Context Clustering Technique for Average Voice Models , 2003 .

[4]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[5]  Keiichi Tokuda,et al.  A training method for average voice model based on shared decision tree context clustering and speaker adaptive training , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[7]  Takao Kobayashi,et al.  Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[10]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[11]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[12]  Takao Kobayashi,et al.  Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis , 2005, IEICE Trans. Inf. Syst..

[13]  Masanobu Abe,et al.  Speaking Styles: Statistical Analysis and Synthesis by a Text-to-Speech System , 1997 .

[14]  Takao Kobayashi,et al.  Model adaptation and adaptive training using ESAT algorithm for HMM-based speech synthesis , 2005, INTERSPEECH.

[15]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[16]  Keiichi Tokuda,et al.  A Study on Context Clustering Techniques and Speaker Adaptive Training for Average Voice Model , 2002 .

[17]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[18]  Jen-Tzung Chien,et al.  Improved Bayesian learning of hidden Markov models for speaker adaptation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[20]  Takao Kobayashi,et al.  Fundamental frequency estimation based on instantaneous frequency amplitude spectrum , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Yannis Stylianou,et al.  A system for voice conversion based on probabilistic classification and a harmonic plus noise model , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[22]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[23]  Chin-Hui Lee,et al.  A structural Bayes approach to speaker adaptation , 2001, IEEE Trans. Speech Audio Process..

[24]  Puming Zhan,et al.  Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Junichi Yamagishi,et al.  HMM-BASED EXPRESSIVE SPEECH SYNTHESIS — TOWARDS TTS WITH ARBITRARY SPEAKING STYLES AND EMOTIONS , 2003 .

[26]  Takao Kobayashi,et al.  A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features , 2006, IEICE Trans. Inf. Syst..

[27]  Norio Higuchi,et al.  Analysis of acoustic features affecting speaker identification , 1995, EUROSPEECH.

[28]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[29]  Vassilios Digalakis,et al.  Speaker adaptation using combined transformation and Bayesian methods , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[30]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[32]  Keiichi Tokuda,et al.  Speaker adaptation of pitch and spectrum for HMM-based speech synthesis , 2002, Systems and Computers in Japan.

[33]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[34]  J. Yamagishi,et al.  HMM-Based Speech Synthesis with Various Speaking Styles Using Model Interpolation , 2004 .

[35]  Keiichi Tokuda,et al.  Speaker adaptation using context clustering decision tree for HMM-based speech synthesis , 2003 .

[36]  Takao Kobayashi,et al.  Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing , 2005, IEICE Trans. Inf. Syst..

[37]  Koichi Shinoda,et al.  Speaker adaptation with autonomous model complexity control by MDL principle , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[38]  Keiichi Tokuda,et al.  Pitch pattern generation using multispace probability distribution HMM , 2002, Systems and Computers in Japan.

[39]  David E. Falkner Other , 2003, The Patrick Moore Practical Astronomy Series.

[40]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[41]  Koichi Shinoda,et al.  Speaker adaptation with autonomous control using tree structure , 1995, EUROSPEECH.

[42]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[43]  Takao Kobayashi,et al.  Performance evaluation of style adaptation for hidden semi-Markov model based speech synthesis , 2005, INTERSPEECH.

[44]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[45]  Takao Kobayashi,et al.  Human Walking Motion Synthesis with Desired Pace and Stride Length Based on HSMM , 2005, IEICE Trans. Inf. Syst..

[46]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[47]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[48]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[49]  Takao Kobayashi,et al.  MLLR adaptation for hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[50]  Mitch Weintraub,et al.  Training issues and channel equalization techniques for the construction of telephone acoustic models using a high-quality speech corpus , 1994, IEEE Trans. Speech Audio Process..

[51]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[52]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[53]  Takao Kobayashi,et al.  HSMM-Based Model Adaptation Algorithms for Average-Voice-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[54]  Keiichi Tokuda,et al.  A context clustering technique for average voice model in HMM-based speech synthesis , 2002, INTERSPEECH.

[55]  Takao Kobayashi,et al.  Modeling of various speaking styles and emotions for HMM-based speech synthesis , 2003, INTERSPEECH.

[56]  Takao Kobayashi,et al.  Human walking motion synthesis based on multiple regression hidden semi-Markov model , 2005, 2005 International Conference on Cyberworlds (CW'05).

[57]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[58]  Takao Kobayashi,et al.  Adaptive training for hidden semi-Markov model [speech synthesis applications] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[59]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[60]  Chin-Hui Lee,et al.  Structural maximum a posteriori linear regression for fast HMM adaptation , 2002, Comput. Speech Lang..

[61]  Keiichi Tokuda,et al.  Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[62]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[63]  Biing-Hwang Juang,et al.  Signal bias removal by maximum likelihood estimation for robust telephone speech recognition , 1996, IEEE Trans. Speech Audio Process..

[64]  Keiichi Tokuda,et al.  Text-to-speech synthesis with arbitrary speaker's voice from average voice , 2001, INTERSPEECH.

[65]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[66]  K. Tokuda,et al.  A Training Method of Average Voice Model for HMM-Based Speech Synthesis , 2003, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[67]  Mark J. F. Gales,et al.  Multiple-cluster adaptive training schemes , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).