HMM-based speech synthesis and its applications

This thesis describes a novel approach to text-to-speech synthesis (TTS) based on hidden Markov model (HMM). There have been several attempts proposed to utilize HMM for constructing TTS systems. Most of such systems are based on waveform concatenation techniques. In the proposed approach, on the contrary, speech parameter sequences are generated from HMM directly based on maximum likelihood criterion. By considering relationship between static and dynamic parameters, smooth spectral sequences are generated according to the statistics of static and dynamic parameters modeled by HMMs. As a result, natural sounding speech can be synthesized. Subjective experimental results demonstrate the effectiveness of the use of dynamic features. Relationship between model complexity and synthesized speech quality is also investigated. To synthesize speech, fundamental frequency (F0) patterns are also required to be modeled and generated. The conventional discrete or continuous HMMs, however, cannot be applied for modeling F0 patterns, since observation sequences of F0 patterns are composed of one-dimensional continuous values and discrete symbol which represents " unvoiced. " To overcome this problem, the HMM is extended so as to be able to model a sequence of observation vectors with variable dimensionality including zero-dimensional observations, i.e., discrete symbols. It is shown that by using this extended HMM, referred to as the multi-space probability distribution HMM (MSD-HMM), spectral parameter sequences and F0 patterns can be modeled and generated in a unified framework of HMM. Since speech parameter sequences are generated directly from HMMs, it is possible to covert voice characteristics of synthetic speech to a given target speaker by applying speaker adaptation techniques proposed in speech recog-i ii nition area. In this thesis, the MAP-VFS algorithm, which is combination of a maximum a posteriori (MAP) estimation and a vector field smoothing (VFS) technique, is applied to the HMM-based TTS system. Results of ABX listening tests averaged for four target speakers (two males and two females) show that speech samples synthesized from adapted models were judged to be closer to target speakers' models than initial speaker independent models by 88% using only one adaptation sentences from each target speaker. Since it has been shown that the HMM-based speech synthesis system have an ability to synthesize speech with arbitrarily given text and speaker's voice characteristics, the HMM-based TTS system can be considered to be applicable to imposture against speaker verification systems. From this point of view, security of speaker verification systems against synthetic speech is investigated. Experimental …

[1]  Keiichi Tokuda,et al.  Quantization of vector sequences using statistics of neighboring input vectors , 1996 .

[2]  Keiichi Tokuda,et al.  Voice characteristics conversion for HMM-based speech synthesis system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[5]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[6]  Massimo Giustiniani,et al.  A hidden Markov model approach to speech synthesis , 1989, EUROSPEECH.

[7]  Keiichi Tokuda,et al.  Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[8]  Keiichi Tokuda,et al.  On the security of HMM-based speaker verification systems against imposture using synthetic speech , 1999, EUROSPEECH.

[9]  Chuan Wang,et al.  Multi channel HMM , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Louis A. Liporace,et al.  Maximum likelihood estimation for multivariate observations of Markov sources , 1982, IEEE Trans. Inf. Theory.

[11]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[12]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[13]  Keiichi Tokuda,et al.  Noisy speech recognition using HMM‐based cepstral parameter generation and compensation , 1996 .

[14]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[15]  H. Sato,et al.  Two-stage F/sub 0/ control model using syllable based F/sub 0/ units , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Keiichi Tokuda,et al.  LIP MOVEMENT SYNTHESIS USING HMMS , 1997 .

[17]  Keiichi Tokuda,et al.  Visual Speech Synthesis Based on Parameter Generation From HMM: Speech-Driven and Text-And-Speech-Driven Approaches , 1998, AVSP.

[18]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[19]  Frank K. Soong A phonetically labeled acoustic segment (PLAS) approach to speech analysis-synthesis , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[20]  Sadaoki Furui,et al.  Likelihood normalization for speaker verification using a phoneme- and speaker-independent model , 1995, Speech Commun..

[21]  Keiichi Tokuda,et al.  A SPEECH PARAMETER GENERATION ALGORITHM BASED ON HMM , 1996 .

[22]  Keiichi Tokuda,et al.  Investigation of State Duration Model based on Gamma distribution for HMM-based Speech Synthesis , 2001 .

[23]  Biing-Hwang Juang,et al.  A study on speaker adaptation of the parameters of continuous density hidden Markov models , 1991, IEEE Trans. Signal Process..

[24]  Keiichi Tokuda,et al.  Text-to-speech synthesis with arbitrary speaker's voice from average voice , 2001, INTERSPEECH.

[25]  Keiichi Tokuda,et al.  Very low bit rate speech coding based on HMMs , 2001, Systems and Computers in Japan.

[26]  Keiichi Tokuda,et al.  Pitch pattern generation using multispace probability distribution HMM , 2002, Systems and Computers in Japan.

[27]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[28]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[29]  Richard V. Cox,et al.  TTS based very low bit rate speech coder , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[30]  Takao Kobayashi,et al.  Speech Spectral Estimation Based on Expansion of Log Spectrum by Arbitrary Basis Functions , 1997 .

[31]  Keiichi Tokuda,et al.  A Study on Phoneme Models for Speech Synsesis Using HMMs. , 1996 .

[32]  Seiichi Nakagawa,et al.  A lOObit/s speech coding using a speech recognition technique , 1989, EUROSPEECH.

[33]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[34]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[35]  Keiichi Tokuda,et al.  A Study on Discrimination between Synthetic and Natural Speech for Speaker Verification Systems , 2001 .

[36]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[37]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[39]  Alex Acero,et al.  Automatic generation of synthesis units for trainable text-to-speech systems , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[40]  Shigeki Sagayama,et al.  Speaker adaptation based on transfer vector field smoothing with continuous mixture density HMMs , 1992, ICSLP.

[41]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[42]  K. Tokuda,et al.  Spectral estimation of speech by mel‐generalized cepstral analysis , 1993 .

[43]  Keiichi Tokuda,et al.  Generalized cepstral analysis of speech - unified approach to LPC and cepstral method , 1990, ICSLP.

[44]  Piero Pierucci,et al.  Phonetic ergodic HMM for speech synthesis , 1991, EUROSPEECH.

[45]  Masatsune Tamura,et al.  A Context Clustering Technique for Average Voice Models , 2003 .

[46]  Takao Kobayashi,et al.  Complex Chebyshev approximation for IIR digital filters using an iterative WLS technique , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[47]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[48]  Takao Kobayashi,et al.  Multi-space probability distribution HMM (Invited paper) , 2002 .

[49]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[50]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[51]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[53]  Matthew J. Makashay,et al.  Corpus-based techniques in the AT&t nextgen synthesis system , 2000, INTERSPEECH.

[54]  Keiichi Tokuda,et al.  HMM compensation for noisy speech recognition based on cepstral parameter generation , 1997, EUROSPEECH.

[55]  Steve J. Young,et al.  State clustering in hidden Markov model-based continuous speech recognition , 1994, Comput. Speech Lang..

[56]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[57]  Heiga Zen,et al.  Improving the performance of HMM-based very low bit rate speech coding , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[58]  Takashi Aso,et al.  Fundamental frequency contour modeling using HMM and categorical multiple regression technique. , 1995 .

[59]  Jean-Luc Gauvain,et al.  Speaker adaptation based on MAP estimation of HMM parameters , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[60]  B.-H. Juang,et al.  Maximum-likelihood estimation for mixture multivariate stochastic observations of Markov chains , 1985, AT&T Technical Journal.

[61]  Yannis Stylianou,et al.  A system for voice conversion based on probabilistic classification and a harmonic plus noise model , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[62]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[63]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[64]  Keiichi Tokuda,et al.  A robust speaker verification system against imposture using an HMM-based speech synthesis system , 2001, INTERSPEECH.

[65]  Mohamed Ismail,et al.  Between recognition and synthesis - 300 bits/second speech coding , 1997, EUROSPEECH.

[66]  F. Itakura,et al.  A statistical method for estimation of speech spectral density and formant frequencies , 1970 .

[67]  Keiichi Tokuda,et al.  A Study on Context Clustering Techniques and Speaker Adaptive Training for Average Voice Model , 2002 .

[68]  Andrej Ljolje,et al.  Automatic speech segmentation for concatenative inventory selection , 1994, SSW.

[69]  Norio Higuchi,et al.  Analysis of acoustic features affecting speaker identification , 1995, EUROSPEECH.

[70]  Keiichi Tokuda,et al.  Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[71]  Sadaoki Furui,et al.  Speaker adaptation of tied-mixture-based phoneme models for text-prompted speaker recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[72]  Keiichi Tokuda,et al.  A context clustering technique for average voice model in HMM-based speech synthesis , 2002, INTERSPEECH.

[73]  Alex Acero,et al.  Formant analysis and synthesis using hidden Markov models , 1999, EUROSPEECH.

[74]  Keiichi Tokuda,et al.  A very low bit rate speech coder using HMM-based speech recognition/synthesis techniques , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[75]  Jun-ichi Takahashi,et al.  Vector-field-smoothed Bayesian learning for fast and incremental speaker/telephone-channel adaptation , 1997, Comput. Speech Lang..

[76]  Keiichi Tokuda,et al.  A very low bit rate speech coder using HMM with speaker adaptation , 1998, ICSLP.

[77]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[78]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[79]  Keiichi Tokuda,et al.  Noisy environment adaptation of HMM using ML parameter generation. , 1996 .

[80]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[81]  Philip C. Woodland,et al.  Automatic speech synthesiser parameter estimation using HMMs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[82]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[83]  Keiichi Tokuda,et al.  Speaker adaptation of pitch and spectrum for HMM-based speech synthesis , 2002, Systems and Computers in Japan.

[84]  Keiichi Tokuda,et al.  Spectral quantization using statistics of static and dynamic features , 1997, 1997 IEEE Workshop on Speech Coding for Telecommunications Proceedings. Back to Basics: Attacking Fundamental Problems in Speech Coding.

[85]  Chiyomi Miyajima,et al.  Speaker identification using Gaussian mixture models based on multi-space probability distribution , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[86]  Masaaki Honda,et al.  LPC speech coding based on variable-length segment quantization , 1988, IEEE Trans. Acoust. Speech Signal Process..

[87]  K. Koishida,et al.  Vector quantization of speech spectral parameters using statistics of dynamic features , 1997 .

[88]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[89]  Keiichi Tokuda,et al.  Imposture using synthetic speech against speaker verification based on spectrum and pitch , 2000, INTERSPEECH.

[90]  Tetsuo Kosaka,et al.  Speaker adaptation based on transfer vector field smoothing using maximum a posteriori probability estimation , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[91]  Isabel Trancoso,et al.  Phonetic vocoding with speaker adaptation , 1997, EUROSPEECH.

[92]  Keiichi Tokuda,et al.  HMM‐based speech synthesis with various voice characteristics , 1996 .

[93]  Keiichi Tokuda,et al.  Imposture against a Speaker Verification System Using Synthetic Speech , 2000 .

[94]  Masafumi Nishimura,et al.  HMM-Based speech recognition using multi-dimensional multi-labeling , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[95]  Keiichi Tokuda,et al.  Pixel-based Lip Movement Synthesis using HMMs , 1999 .

[96]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[97]  Richard M. Schwartz,et al.  A segment vocoder at 150 b/s , 1983, ICASSP.

[98]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[99]  Keiichi Tokuda,et al.  Vector Quantization of Speech Spectral Parameters Using Statistics of Static and Dynamic Features , 2001 .