A study on the consistency analysis of energy parameter for Mandarin speech

In this study, a consistency analysis of energy parameter for Mandarin speech is presented. Identified as a result of inspection of the human pronunciation process, the consistency can be interpreted as a high correlation of a warping curve between the spectrum and the prosody intra a syllable. Through three steps in the procedure of the consistency analysis, the hidden Markov model (HMM) algorithm is used first to decode HMM-state sequences within a syllable at the same time as to divide them into three segments. Second, based on a designated syllable, the vector quantization (VQ) with the Linde–Buzo–Gray algorithm is used to train the VQ codebooks of each segment. Third, the energy vector of each segment is encoded as an index by VQ codebooks, and then the probability of each possible path is evaluated as a prerequisite to analyze the consistency. It is demonstrated experimentally that a consistency is definitely acquired in case the syllable is located exactly in the same word. These results offer a research direction that the energy warping process intra a syllable must be considered in a text-to-speech system to improve the synthesized speech quality.

[1]  Shaw-Hwa Hwang,et al.  A Mandarin text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  Adrian Hilton,et al.  Model-based synthesis of visual speech movements from 3D video , 2009, SIGGRAPH '09.

[3]  Wesley Mattheyses,et al.  On the Importance of Audiovisual Coherence for the Perceived Quality of Synthesized Visual Speech , 2009, EURASIP J. Audio Speech Music. Process..

[4]  S.-H. Hwang,et al.  Efficient text analyser with prosody generator-driven approach for Mandarin text-to-speech , 2005 .

[5]  Aimilios Chalamandaris,et al.  A unit selection text-to-speech synthesis system optimized for use with screen readers , 2010, IEEE Transactions on Consumer Electronics.

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  Michael H. O'Malley Text-to-speech conversion technology , 1990, Computer.

[8]  Aimilios Chalamandaris,et al.  Embedded unit selection text-to-speech synthesis for mobile devices , 2009, IEEE Transactions on Consumer Electronics.

[9]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[10]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[11]  Cumhur Erkut,et al.  Real-Time Recognition of Percussive Sounds by a Model-Based Method , 2011, EURASIP J. Adv. Signal Process..

[12]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Björn Johnsson Hidden Markov Models in Spoken Language Processing , 2004 .

[14]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[15]  Sergio M. Savaresi,et al.  Smartphone-Based Vehicle-to-Driver/Environment Interaction System for Motorcycles , 2010, IEEE Embedded Systems Letters.

[16]  Biing-Hwang Juang,et al.  Hidden Markov Models , 2003 .

[17]  Jerome R. Bellegarda A Dynamic Cost Weighting Framework for Unit Selection Text–to–Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Carl Baribault,et al.  Hidden Markov Model with Duration Side Information for Novel HMMD Derivation, with Application to Eukaryotic Gene Finding , 2010, EURASIP J. Adv. Signal Process..

[19]  Cheng-Yu Yeh,et al.  An efficient text analyzer with prosody generator-driven approach for Mandarin text-to-speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[20]  Chung-Hsien Wu,et al.  Automatic generation of synthesis units and prosodic information for Chinese concatenative synthesis , 2001, Speech Commun..

[21]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[22]  Li Zhao,et al.  A Chinese text to speech system based on TD-PSOLA , 2002, 2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering. TENCOM '02. Proceedings..

[23]  Yue Dong-jian Two stage concatenation speech synthesis for embedded devices , 2010, 2010 International Conference on Audio, Language and Image Processing.

[24]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[25]  Chiu-yu Tseng,et al.  The synthesis rules in a Chinese text-to-speech system , 1989, IEEE Trans. Acoust. Speech Signal Process..

[26]  Xiaohua Shi,et al.  An RNN-based algorithm to detect prosodic phrase for Chinese TTS , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[27]  Chiu-yu Tseng,et al.  A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese , 2002, IEEE Trans. Speech Audio Process..