In general purpose concatenated waveform synthesis an exhaustive stored waveforms inventory is needed. Our SPRUCE system is syllable and word based, but for general purpose work its inventory needs examples of all possible syllables. The high-level synthesis engine used to generate the phonology and prosody of utterances is already general purpose – but its use is constrained by small low-level inventories of re-combinable waveforms. The feasibility study reported here was carried out to determine whether we could take one of the word based limited domain versions of the system, and make it more general by excising syllables from existing polysyllabic words and recombining them into new words. Initially the study treats temporal rather than spectral considerations. 1. PRELIMINARIES Concatenated waveform synthesis [1] uses an inventory of stored waveforms. This paper reports experiments in enlarging MeteoSPRUCE – a weather forecasting application of our general purpose high-level tts engine SPRUCE [2] to widen its usability without the need for re-recording [3] [4] [5] [6] [7]. Before embarking on the task of excising and recombining we needed to be clear on a number of basic theoretical points: • Phonological symbolic representations [8] are of limited use for identifying syllables in the waveform. The phonological concept boundary carries uneasily through to the waveform. • Phonetic representations [9] are also symbolic, and although we can identify an allophone string corresponding to a phonological syllable there is still often no clear feature for acoustically delimiting syllables. • The notion boundary as a point for cutting a waveform is misleading. Acoustic syllables often overlap, telescope or merge, and one syllable may ‘begin’ before the previous one has ‘ended’; that is, the time allocated to a sequenced pair of syllables is not always the sum of the individual times. • Coarticulation [10] or coproduction [10] [11] responsible for temporal overlap is also responsible for spectral overlap. Even if cuts are made at the ‘right’ places there is a problem of including spectral boundary effects from both syllables when they are recombined in new but ‘wrong’ contexts. 2. A SIMPLE EXAMPLE The 2000-word MeteoSPRUCE database includes waveforms of the words unsettled and likely: let’s try using these to create a new word unlikely – i.e. to detach the syllable un and place it in front of the like syllable of likely. Phonetic syllable boundaries are marked in the database morphemically if possible or phonologically. Fig.1 shows the database entries. Fig.1 unsettled and likely in the MeteoSPRUCE database. By cutting unsettled at the end of the last pitch period of un we can paste the beginning of the file to the start of likely to produce a new reconstructed word object *unlikely. Fig.2 compares the result of conjoining the syllables with a recording of unlikely which on this occasion is in the database. Fig.2 Reconstruction of *unlikely, and the recorded waveform of unlikely in MeteoSPRUCE. The degree of coproduction between syllables is context dependent – we deliberately picked the syllable un in unsettled because it showed the minimum of ‘telescoping’ coproduction. Fig.3 Reconstruction of *unlikely using the derived synthetic syllable un and the recorded word likely (also normalised at the beginning of the word to form the synthetic syllable like). page 2303 ICPhS99 San Francisco So far, we have identified three stages in the reconstruction procedure: a. phonetic syllable excision, b. normalisation, c. synthetic syllable conjoining. There are errors in the reconstruction, and the transition between the syllables un and like appears protracted and awkwardly joined. An improvement (Fig.3) is obtained by a normalising procedure dealing with syllable overlap. The procedure involves setting up a synthetic syllable, derived in the normalisation process from the phonetic syllable. 3. IDENTIFYING AND DESCRIBING SYLLABLES To clarify the concept of recovery: it may be possible to excise a stretch of waveform of the right length from a suitable word, but because of coproduction effects it is unlikely to be directly reusable except in a similar context. Recovery means excision and reconstruction. The excised stretch of waveform – the phonetic syllable – is going to be used as the basis for reconstructing the desired waveform – the synthetic syllable. The procedure we have developed for syllable recovery calls for syllable models defined on three different levels. Phonological syllable – a unit higher than the ‘sound’ segment [12]. Introduced to form a framework for characterising the sequencing of simple segments, it provides the primary unit for modelling prosody. Phonetic detail is irrelevant at this level: nonlinear organisation into syllabic units is important. We characterise phonological syllables as in linguistics [13]. In our model the phonological syllable figures prominently because it enables direct reference to a listener’s perception of ‘sound’ sequencing – the phonological syllable characterises for us the result of successful perception. Since our synthesis philosophy revolves around satisfying a listener’s perceptual abilities we need a level specifically designed to capture this. So, listeners identify a unit at the beginning of unsettled, pronounce it in isolation and tell us that it is the same as a unit identified at the start of the word unlikely. This cognitive similarity is not the same as acoustic similarity – coarticulatory phenomena constrain the two uns to be systematically different acoustically. The goal of the reconstruction procedure is to use a portion of the waveform of unsettled to change likely into a correctly perceived new word unlikely. Phonetic syllable – a descriptive unit characterising part of a human acoustic signal prompting a listener to identify a phonological syllable. This is where distinguishing acoustic features are identified, as well as other acoustic features. The model describes the waveform as in acoustic phonetics [14]. What ‘sounds’ are sequenced in a phonetic syllable is a phonological rather than phonetic matter in our reconstruction procedure. The phonetic syllable is the waveform which triggers the phonological syllable – and its phonetic description. There has been a lot of discussion concerning the relationship between phonetic and phonological characterisations of the same stretch of speech [15]. The phonetic syllable models the acoustic signal and the phonological syllable models a cognitive response to the signal. The models are linked since they each deal with the same signal. Notice that we are using the term to refer to both a stretch of waveform and its acoustic model. Synthetic syllable – a model of an acoustic stretch which can be manipulated to trigger in the listener a response of the right phonological syllable. The synthetic syllable may or may not be the same as the phonetic syllable from which it is derived. In SPRUCE a waveform in the database can be a phonetic syllable (modelling the human syllable, e.g. snow), but it is also there as a synthetic syllable – a model for concatenation to produce a new word, e.g. snowing. The synthetic syllable derives from a phonetic entry in the database by a normalisation procedure which varies in complexity depending on syllable type and the environment from which it is to be excised – that is, the normalisation process is both context and type sensitive. 4. SYLLABLE TYPES AND CONTEXTS We classify syllable types by their phonological start (onset) and end (coda). Initially we were concerned about coarticulatory effects between phonetic syllables, i.e. that reconstructed words should have the correct temporal and spectral phonetic properties at new syllable boundaries. However, taking full account of all acoustic effects of quality change resulting from coproduction all combinations would need to be considered. For this initial study we reduced the problem to a working model of temporal syllable combining. Defocusing phonetic quality at syllable boundaries, we refocused on temporal properties of onset and offset. Examination of all words in the database revealed that our working model might need deal only in initial and final segment types, rather than all possible occurring individual segments. We established segment types according to the usual phonetic parameters [4]. So, all syllables include a vowel segment preceded by up to three phonetic consonants and followed by up to four:
[1]
Christel Sorin,et al.
Levels in speech communication : relations and interactions : a tribute to Max Wajskop = Hommage à Max Wajskop
,
1995
.
[2]
Shinya Nakajima.
Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering
,
1994,
Speech Commun..
[3]
Douglas D. O'Shaughnessy,et al.
Speech communication : human and machine
,
1987
.
[4]
Olivier Boëffard,et al.
Automatic generation of optimized unit dictionaries for text to speech synthesis
,
1992,
ICSLP.
[5]
Mark Tatham,et al.
Supervision of speech production
,
1995
.
[6]
A. C. Gimson,et al.
An introduction to the pronunciation of English
,
1991
.
[7]
Jan P. van Hemert,et al.
Automatic segmentation of speech
,
1991,
IEEE Trans. Signal Process..
[8]
Alan W. Black,et al.
Unit selection in a concatenative speech synthesis system using a large speech database
,
1996,
1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.
[9]
Haruo Kubozono,et al.
Autosegmental and metrical phonology. By JOHN A GOLDSMITH. Oxford: Basil Blackwell, 1990. vii, 376
,
1991
.
[10]
Alan W. Black,et al.
Prosody and the Selection of Source Units for Concatenative Synthesis
,
1997
.