Multilingual Speech Synthesis

This chapter discusses the issues involved in creating and using speech output in multiple languages—that is, multilingual speech synthesis—and describes some of the current technologies to build synthetic voices in new languages. It presents the basic steps involved in building synthesis in a new language, which include defining a phone set, defining a lexicon, designing a database to record, recording the database, building the synthesizer, text normalization, creation of prosodic models, evaluation and tuning, and addressing language-specific issues. Widely available tools, such as those provided in the FestVox suite, have helped to increase the number of experts trained in speech synthesis and have thus paved the way for successful research ad-commercial systems. For waveform synthesis, concatenative synthesis is the easiest technique and produce high-quality output. There are two fundamental techniques in concatenative synthesis: diphone synthesis and unit selection. Diphone synthesis follows the observation that phone boundaries are the most dynamic portions of the acoustic signal and thus the least appropriate places for joining units. Unit selection speech synthesis is based on the concatenation of appropriate sub-word units selected from a database of natural speech. A description of large evaluation efforts across languages complements this chapter.