Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis

The shape and synchronization of the lip movement with speech seems to be one of the important factors in the acceptability of a synthetic persona, particularly as synthetic beings approach human photo-realism. Most of us cannot lipread nor easily identify a sound by lip-shape alone, but we can readily detect whether the lip movements of a synthetic talking head are acceptable or not. This is true even when the viewer/listener is a considerable distance from the speaker. In addition, experiments have shown that visible synthetic speech is important in augmenting audible synthetic speech, in terms of ease of understandability and recognition accuracy. This is particularly true in noisy conditions where the audio signal is degraded [1]. Synthesizing the right lip movements for talking heads is therefore an important task in achieving a high degree of naturalness, as well as for potential applications where they provide assistance to hearing impaired individuals. One of the major challenges, in speech synthesis as well as lip-motion synthesis is in the modelling of coarticulation. Coarticulation is the influence on the articulation of a speech segment of the preceding (backward/retentive coarticulation) and following speech segments (forward/anticipatory coarticulation). Coarticulation effects in speech have been shown to effect speech sounds up to 6 segments away [2]. Various techniques have been used to model visual coarticulation, all of which make assumptions about the degree of forward and backward influences and the way in which these are modeled – from simple additive influences to complex mathematical models. Usually these models are physiologically grounded; for example the speed at which mouth shape muscles can react may be one important factor. However, rule based models are by their very nature complex, since the physiology of the visible articulation musculature is also complex. Rather than explicitly modelling this face physiology, we present a data-driven method where the dynamics of the facial musculature is captured in synchronization with the acoustic data. This approach is an improvement on other data-driven techniques [10,11] as it allows us to model visual coarticulatory effects as an extension of a concatenative speech synthesis unit selection process. Concatenative synthesis relies on the ability to extract appropriate contextual (hence capturing coarticulatory effects) N-phone units of speech which are then concatenated and deformed based on linguistic criteria – for example if stress or appropriate pitch change and duration changes are required for intonation. Our hypothesis is that these linguistic criteria are also applicable to the visual lipsynthesis in a similar way. This paper investigates how the visual unit selection process is realized. 1. UNIT INVENTORIES IN SYNTHESIS In creating the speech unit database every effort is usually given to ensure that all units exist, at the diphone or triphone level. However, there may be occasions when the right units may not be available for concatenation. Also, perhaps more importantly, the linguistic environment has an important impact on unit selection as the units are extracted from real speech, they will have a set of intrinsic characteristics which will include linguistic and paralinguistic influences. When a speech database is recorded, care must be taken to ensure that the recording process removes or minimizes these influences, with the possibility that they can be regenerated as required