Integrating Language Generation with Speech Synthesis in a Concept to Speech System

Concept To Speech (CTS) systems are closely related to two other types of systems: Natural Language Generation (NLG) and Speech Synthesis (SS). In this paper, we propose a new architecture for a CTS system. A Speech Integrating Markup Language (SIML) is designed as an general interface for integrating NLG and SS. We also present a CTS system for a multimedia presentation generation application. We discuss how to extend the current CTS system based on the new architecture. Currently, only limited semantic, syntactic and prosodic features are covered inour prototype system. 1 I n t r o d u c t i o n Currently, there are two ways to develop a ConceptTo-Speech (CTS) system. The first is to design a monolithic CTS system for a specific application. This design involves a specific NLG module and an SS module, often developed for the application, where discourse, semantic and syntactic information produced by the NLG module can be used directly by CTS algorithms to determine either system specific parameters for a TextTo-Speech system, or phonological parameters for a vocal tract model (e.g., (Young and Fallside, 1979)). One advantage of this design is its efficiency, but features from the two systems are usually so intertwined that the interface of the CTS algorithms are system dependent. Another design is to keep NLG and SS as independent as possible, thus allowing reuse of the current NLG tools and TTS systems for other applications. The typical design is equivalent to "NLG plus Text-ToSpeech( TTS)" where the common interface between NLG and TTS is plain text. One advantage of this is in a its simplicity and adaptability. No change is necessary for existing NLG tools and TTS systems, but it suffers from a serious problem in that it loses useful information. All discourse, semantic and syntactic information is lost when the internal representation of NLG is converted to the text output and clearly this could be useful in determining prosody. In this paper, we want to maintain the autonomy of NLG and SS so that they are reusable for different applications, yet flexible enough to easily integrate without losing useful information. We propose a new architecture in which the common interface is not plain text, but a Speech Integrating Markup Language (SIML). We show how this architecture can be used in a multimedia presentation application where a prototype SIML was designed for this purpose.