An HMM Based Pitch-Contour Generation Method for Mandarin Speech Synthesis

In this paper, a method is proposed to generate pitch-contours for Mandarin speech synthesis. In this method, an HMM (hidden Markov model) is used to model the pro- sodic states implicitly stayed and a syllable's pitch-contour is treated as an observation generated from a prosodic state. Such an HMM is called a syllable pitch-contour HMM (SPC-HMM). For training the SPC-HMM, we developed a feasible method to normalize a pitch-contour's height. After normalization, each training syllable's pitch-contour is vector quantized and represented with a VQ (vector quantization) code. Then, the VQ code and its adjacent syllables' lexical tones are combined to define an observation symbol for training the SPC-HMM. In the synthesis phase, a sentence-wide most prob- able observation symbol sequence is searched on the SPC-HMM using a dynamic pro- gramming algorithm proposed here. Then, the observation symbol found for a syllable is decoded to obtain its pitch-contour VQ code. We conducted testing experiments to de- termine the size of a pitch-contour codebook and the number of states for an SPC-HMM. The results indicate that setting the codebook size to eight and using six states are the best choices. Also, we conducted perception tests to compare the naturalness levels of synthetic speech files. The results show that the two generation modes for operating an SPC-HMM studied here are comparable to each other in naturalness level.

[1]  Neng-Huang Pan,et al.  A statistical model with hierarchical structure for predicting prosody in a mandarin text‐to‐speech system , 2005 .

[2]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[3]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[4]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[5]  Patricia Riddle,et al.  Modelling and synthesising F0 contours with the discrete cosine transform , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Andrej Ljolje,et al.  Synthesis of natural sounding pitch contours in isolated utterances using hidden Markov models , 1986, IEEE Trans. Acoust. Speech Signal Process..

[7]  Chung-Chieh Yang,et al.  A SENTENCE-PITCH-CONTOUR GENERATION METHOD USING VQ/HMM FOR MANDARIN TEXT-TO-SPEECH , 2000 .

[8]  Hung-Yan Gu,et al.  Model spectrum-progression with DTW and ANN for speech synthesis , 2009, 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.

[9]  Jyh-Shing Roger Jang,et al.  Automatic Pronunciation Assessment for Mandarin Chinese: Approaches and System Overview , 2007, ROCLING/IJCLCLP.

[10]  Hung-Yan Gu,et al.  A System Framework for Integrated Synthesis of Mandarin, Min-Nan, and Hakka Speech , 2007, ROCLING/IJCLCLP.

[11]  J. Stoer,et al.  Introduction to Numerical Analysis , 2002 .

[12]  Hsiao-Chuan Wang,et al.  Hidden Markov model for Mandarin lexical tone recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[13]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[14]  Hiroya Fujisaki,et al.  Dynamic Characteristics of Voice Fundamental Frequency in Speech and Singing , 1983 .

[15]  Takashi Aso,et al.  A study on pitch pattern generation using HMM-based statistical information , 1994, ICSLP.

[16]  Chilin Shih,et al.  Issues in Text-to-Speech Conversion for Mandarin , 1996, Int. J. Comput. Linguistics Chin. Lang. Process..

[17]  Richard Sproat Multilingual Text-to-Speech Synthesis , 1997 .

[18]  Sin-Horng Chen,et al.  Vector quantization of pitch information in Mandarin speech , 1990, IEEE Trans. Commun..

[19]  Masafumi Nishimura,et al.  Isolated word recognition using hidden Markov models , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Khalid Sayood,et al.  Introduction to data compression (2nd ed.) , 2000 .

[21]  Hung-Yan Gu,et al.  An HNM Based Scheme for Synthesizing Mandarin Syllable Signal , 2008, Int. J. Comput. Linguistics Chin. Lang. Process..

[22]  Begnaud Francis Hildebrand,et al.  Introduction to numerical analysis: 2nd edition , 1987 .

[23]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[24]  Jyh-Yeong Chang,et al.  A novel prosodic-information synthesizer based on recurrent fuzzy neural network for the Chinese TTS system , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[25]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[26]  Chiu-yu Tseng,et al.  Improved tone concatenation rules in a formant-based Chinese text-to-speech system , 1993, IEEE Trans. Speech Audio Process..

[27]  Khalid Sayood,et al.  Introduction to Data Compression , 1996 .

[28]  Daniel P. W. Ellis,et al.  Stylization of pitch with syllable-based linear segments , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Kim-Teng Lua,et al.  Pitch contour model for Chinese text-to-speech using CART and statistical model , 2002, INTERSPEECH.

[30]  Yannis Stylianou,et al.  Modeling Speech Based on Harmonic Plus Noise Models , 2004, Summer School on Neural Networks.

[31]  Hung-Yan Gu,et al.  A MANDARIN-SYLLABLE SIGNAL SYNTHESIS METHOD WITH INCREASED FLEXIBILITY IN DURATION, TONE AND TIMBRE CONTROL , 1998 .