论文信息 - Integrating Language Generation with Speech Synthesis in a Concept to Speech System

Integrating Language Generation with Speech Synthesis in a Concept to Speech System

Concept To Speech (CTS) systems are closely related to two other types of systems: Natural Language Generation (NLG) and Speech Synthesis (SS). In this paper, we propose a new architecture for a CTS system. A Speech Integrating Markup Language (SIML) is designed as an general interface for integrating NLG and SS. We also present a CTS system for a multimedia presentation generation application. We discuss how to extend the current CTS system based on the new architecture. Currently, only limited semantic, syntactic and prosodic features are covered inour prototype system. 1 I n t r o d u c t i o n Currently, there are two ways to develop a ConceptTo-Speech (CTS) system. The first is to design a monolithic CTS system for a specific application. This design involves a specific NLG module and an SS module, often developed for the application, where discourse, semantic and syntactic information produced by the NLG module can be used directly by CTS algorithms to determine either system specific parameters for a TextTo-Speech system, or phonological parameters for a vocal tract model (e.g., (Young and Fallside, 1979)). One advantage of this design is its efficiency, but features from the two systems are usually so intertwined that the interface of the CTS algorithms are system dependent. Another design is to keep NLG and SS as independent as possible, thus allowing reuse of the current NLG tools and TTS systems for other applications. The typical design is equivalent to "NLG plus Text-ToSpeech( TTS)" where the common interface between NLG and TTS is plain text. One advantage of this is in a its simplicity and adaptability. No change is necessary for existing NLG tools and TTS systems, but it suffers from a serious problem in that it loses useful information. All discourse, semantic and syntactic information is lost when the internal representation of NLG is converted to the text output and clearly this could be useful in determining prosody. In this paper, we want to maintain the autonomy of NLG and SS so that they are reusable for different applications, yet flexible enough to easily integrate without losing useful information. We propose a new architecture in which the common interface is not plain text, but a Speech Integrating Markup Language (SIML). We show how this architecture can be used in a multimedia presentation application where a prototype SIML was designed for this purpose.

Shimei Pan | Kathleen R. McKeown | K. McKeown | Shimei Pan

[1] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[2] Mark Steedman,et al. Representing discourse information for spoken dialogue generation , 1996 .

[3] James Paul Gee,et al. Performance structures: A psycholinguistic and linguistic appraisal , 1983, Cognitive Psychology.

[4] Scott Prevost,et al. A semantics of contrast and information structure for specifying intonation in spoken language generation , 1996 .

[5] C. M. Sperberg-McQueen,et al. Guidelines for electronic text encoding and interchange , 1994 .

[6] Michael Elhadad,et al. Using argumentation to control lexical choice: a functional unification implementation , 1993 .

[7] F. Fallside,et al. Speech synthesis from concept: A method for speech output from information systems , 1979 .

[8] Steven K. Feiner,et al. Negotiation for automated generation of temporal multimedia presentations , 1997, MULTIMEDIA '96.

[9] Shimei Pan,et al. Spoken language generation in a multimedia system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10] Françoise Emerard,et al. Synthesis of Spoken Messages from Semantic Representations. Semantic-Representation-to-Speech System , 1986, COLING.

[11] Amy Isard,et al. SSML: A Markup Language for Speech Synthesis , 1995 .

[12] Eileen Fitzpatrick,et al. A Computational Grammar of Discourse-Neutral Prosodic Phrasing in English , 1990, Comput. Linguistics.