Slovenian Text-to-Speech Synthesis for Speech User Interfaces

The paper presents the design concept of a unitselection text-to-speech synthesis system for the Slovenian language. Due to its modular and upgradable architecture, the system can be used in a variety of speech user interface applications, ranging from server carrier-grade voice portal applications, desktop user interfaces to specialized embedded devices. Since memory and processing power requirements are important factors for a possible implementation in embedded devices, lexica and speech corpora need to be reduced. We describe a simple and efficient implementation of a greedy subset selection algorithm that extracts a compact subset of high coverage text sentences. The experiment on a reference text corpus showed that the subset selection algorithm produced a compact sentence subset with a small redundancy. The adequacy of the spoken output was evaluated by several subjective tests as they are recommended by the International Telecommunication Union ITU. Keywords—text-to-speech synthesis, prosody modeling, speech user interface.

[1]  Mehryar Mohri,et al.  Rapid unit selection from a large speech corpus for concatenative speech synthesis , 1999, EUROSPEECH.

[2]  Richard Sproat,et al.  The bell labs German text-to-speech system: an overview , 1997, EUROSPEECH.

[3]  Simon Dobrisek,et al.  Spoken Language Resources at LUKS of the University of Ljubljana , 2003, Int. J. Speech Technol..

[4]  Matjaz Gams,et al.  SPEAKER (GOVOREC): A Complete Slovenian Text-to Speech System , 2003, Int. J. Speech Technol..

[5]  Alistair Conkie A robust unit selection system for speech synthesis , 1999 .

[6]  Alan W. Black,et al.  Arabic in my hand: small-footprint synthesis of egyptian arabic , 2003, INTERSPEECH.

[7]  Thierry Dutoit,et al.  High-quality speech synthesis for phonetic speech segmentation , 1997, EUROSPEECH.

[8]  Thierry Dutoit,et al.  Text design for TTS speech corpus building using a modified greedy selection , 2003, INTERSPEECH.

[9]  Hideyuki Mizuno,et al.  Recording script design for corpus-based TTS system based on coverage of various phonetic elements , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  France Mihelic,et al.  Syllable and segment duration at different speaking rates in the Slovenian language , 1997, EUROSPEECH.

[11]  Simon Dobrisek,et al.  Homer II - man-machine interface to internet for blind and visually impaired people , 2003, Comput. Commun..

[12]  F. Park ROBUST UNIT SELECTION SYSTEM FOR SPEECH SYNTHESIS , 1999 .

[13]  Joram Meron,et al.  Compression of exception lexicons for small footprint grapheme-to-phoneme conversion , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[14]  France Mihelic,et al.  Evaluation of the Slovenian HMM-Based Speech Synthesis System , 2004, TSD.

[15]  France Mihelic,et al.  Slovene Interactive Text-to-Speech Evaluation Site - SITES , 1999, TSD.

[16]  Chih-Chung Kuo,et al.  Efficient and scalable methods for text script generation in corpus-based TTS design , 2002, INTERSPEECH.

[17]  France Mihelic,et al.  Speech timing in Slovenian TTS , 1997, EUROSPEECH.

[18]  Jan P. H. van Santen,et al.  Methods for optimal text selection , 1997, EUROSPEECH.

[19]  Georgios Kouroupetroglou,et al.  An intonation model for embedded devices based on natural F0 samples , 2004, INTERSPEECH.

[20]  Alan W. Black,et al.  Flite: a small fast run-time synthesis engine , 2001, SSW.

[21]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[22]  Jani Nurminen,et al.  Optimal subset selection from text databases , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..