Output requirements for a high-quality speech synthesis system: the case of disambiguation.

details of speech generation, especially those related to prosody. • “Compactness”: The synthesis mechanism should require as few computational resources as possible. This has been accomplished by favoring algorithmic solutions over database solutions, and by implementing only a proximal grammar, i.e., rendering grammatical relations sensitive primarily to immediately surrounding lexical elements, and maximally, to those lying in the range of the prosodic group. Sometimes, speakers attempt to disambiguate multiple interpretations by means of modifications of prosodic parameters, while at other times, they use various types of circumlocutions for this purpose. In view of a potential implementation of prosodic disambiguation in high-quality speech synthesis systems, the following questions arise: (1) Which types of ambiguity are open to prosodic disambiguation? (2) How are prosodic parameters modified as a function of the disambiguation attempt? (3) Can a highquality speech synthesis system mimick prosodic disambiguation effects? We report on small pilot project performed to explore these issues. The current status of the system (autumn 1996) is as follows: the lexicogrammatical, phonological and prosodic modules are by and large completed, and the diphone output module is scheduled to be completed in the summer of 1997. Currently, we use the Mons (Belgium) MBROLA diphone output system to produce audible output1. The system consists of a 350k application (without interface) and uses a 2 Mb dictionary as well as a 4.8 Mb diphone database. Introduction LAIPTTS is a high-quality text-to-speech system for French, developed at the University of Lausanne. Its creation was guided by two main objectives: Like most text-to-speech systems, LAIPTTS is structured into four main modules (Figure 1). The first module takes written text and generates an annotated phonetic chain of each sentence on the basis of a dictionary and graphemo-phonetic rules. The chain is parsed into prosodic groups, and various • “High Quality”: Synthetic speech should resemble natural speech as much as possible, so as to permit the greatest possible ease in the comprehension of the spoken message (see Sanderman, 1996). In our system, this has been accomplished by paying exceptionally close attention to the phonetic 1 The MBROLA diphone output system is produced by Thierry Dutoit of Mons, Belgium. For details on MBROLA, please see http://tcts.fpms.ac.be/synthesis. Keller & Zellner Output Requirements for Disambiguation 2 phonological rules (liaison, chaining, schwahandling, syllabification) are applied. In the next module, durations and Fo values are generated. These values are combined with diphone segments in the subsequent module, after which the entire signal is reproduced. Calculations for the second and following sentences in a text are performed in parallel with sound outputting of the preceding sentence. This provides real-time synthesis performance on unrestricted text on most highentry personal computers (PowerPC, Pentiumlevel). prosody, as well as the implementations we have chosen to meet these requirements. Prosodic grouping: The system must perform word grouping in the way human speakers do. The placement of group marks, as well as the implementation of pauses and group-final lengthening, depends on this process. In order to meet this requirement, we have developed a psycholinguistic algorithm oriented towards timing which is largely inspired by, and builds upon, the large body of research on psycholinguistic indicators of prosodic grouping by Grosjean and colleagues (Gee & Grosjean, 1983; Monnin & Grosjean, 1993; Keller et al. 1993; Zellner, 1996) (Figure 2). For a first cut, the distinction of two levels (major groups, minor groups) appears sufficient for the purpose of predicting timing. For reasons detailed in Zellner (1996), psycholinguistic grouping algorithms appear to provide better predictions for overall prosodic grouping in French than do syntactically-based algorithms. Key Output Requirements Implementation details of a synthesis system can be summarized in terms of its output requirements. To place the implementation of prosodic disambiguation strategies into its appropriate context, we shall briefly review the three key output requirements related to Syntactic, lexical and phonological processing