论文信息 - of Structure to Speech Conversion Using Iterative Opti m ization

of Structure to Speech Conversion Using Iterative Opti m ization

This paper describes a new and improved method for the framework of structure to speech conversion we previously proposed. Most of the speech synthesizers take a phoneme sequence as input and generate speech by converting each of the phonemes into its corresponding sound. In other words, they simulate a human process of reading text out. However, infants usually acquire speech communication ability without text or phoneme sequences. Since their phonemic awareness is very immature, they can hardly decompose an utterance into a sequence of phones or phonemes. In this situation, as developmental psychology states, infants acquire the holistic sound pattern of words from the utterances of their parents, called word Gestalt, and they reproduce it with their vocal tubes. This behavior is called vocal imitation. In our previous studies, the word Gestalt was defined physically and a method of extracting it from an utterance was proposed. We already applied the word Gestalt to ASR, CALL, and also speech generation, which we call structure to speech conversion. Unlike a reading machine, our framework simulates infants’ vocal imitation. In this paper, a method for improving our speech generation framework using iterative optimization is proposed and evaluated.

Y. Qiao | D. Saito | K. Hirose | N. Minematsu

[1] Y. Qiao,et al. Dialect-based Speaker Classification of Chinese Using Structural Representation of Pronunciation , 2009 .

[2] Keikichi Hirose,et al. Multi-stream parameterization for structural speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Keikichi Hirose,et al. Directional dependency of cepstrum on vocal tract length , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] Keikichi Hirose,et al. Structure to speech conversion - speech generation based on infant-like vocal imitation , 2008, INTERSPEECH.

[5] Keikichi Hirose,et al. STRUCTURAL REPRESENTATION OF THE PRONUNCIATION AND ITS USE FOR CALL , 2006, 2006 IEEE Spoken Language Technology Workshop.

[6] Nobuaki Minematsu. Mathematical evidence of the acoustic universal structure in speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7] Keiichi Tokuda,et al. Imposture using synthetic speech against speaker verification based on spectrum and pitch , 2000, INTERSPEECH.

[8] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..