We describe attempts to synthesize visible speech in real-time on a MacintoshTM personal computer, and to enable the user to color the text of the speech to be synthesized emotionally, according to the user’s wishes, the representation of the text, or the semantics of the utterance. The animated visible speech will be demonstrated, in real time, using a variety of on-screen agents and faces. The speech synthesizer used is Apple Computer’s MacinTalkPro2®, running on a Quadra AV® computer. Introduction Researchers in disciplines as diverse as ethology, psychology, speech therapy, interpersonal communications and human-machine interface design agree that facial expressions enhance communication. Facial expressions convey both emotion (Ekman and Friesen, 1986) and other communicative signals (Chovil, 1991). A multimodal form of interaction with a computer on-screen agent is likely to improve the quality of the dialogue, both in terms of intelligibility and with regard to user-friendliness (Pelachaud et al., 1993). Our goal is to improve the human-machine dialogue by providing on-screen agents that both speak and have visual expressions. If the text to be spoken is pre-determined and invariant, then the agent may speak ‘canned’ phrases, recorded by a human speaker, and stored as digitized movies. If, however, the agent is expected to speak unconstrained, novel utterances, then synthetic speech must be used and synchronized with facial movements. To this end, a system using disemes is described. Furthermore, adding vocal emotions to synthetic speech improve its naturalness and acceptability, and make it more ‘human’. We provide the user with the ability to generate and author vocal emotions in synthetic speech, using a limited number of prosodic parameters with the concatenative speech synthesizer. The vocal emotions are represented visually, and can be specified and manipulated directly, using a graphical editor. Advantages, disadvantages and shortcomings of our work are discussed, together with future directions. Disemes Some previous visible speech systems have used ‘visemes’, i.e. minimal contrastive units of visible articulation of speech sounds. Visemes have typically ranged in number from 9 to 32 for American English. The reduction of phonetic and visually salient contrasts in work of that kind, together with the naive assumptions about spelling-to-phoneme correspondences, are sufficient to indicate that animated speech using these visemes would be cartoon-like and unsuitable for representing human speech with any accuracy. Using only viseme-transitions yields speech that is choppy, inaccurate, and insufficiently plastic. Other animations have concatenated naturallyuttered segments, and the results are also choppy because the visual targets at the boundaries of the segments are highly dependent on phonetic context. Most of the work in the graphical domain has used linear transitions between 3-D key-frames, with often very poor results as far as visible speech intelligibility is concerned. Concatenative speech synthesis systems, such as the Apple Macintosh MacinTalkPro2TM TTS system, require a large inventory of diphone units from which to create utterances. Two improvements are therefore needed: expansion beyond simple visemes, and reduction of the number of diphones so that they may be mapped to facial images in real-time on a personal computer. Synthetic speech that concatenates transitional units for speech (e.g. diphones), has an advantage over rule-based (parametric) synthetic speech in that it retains the natural transitions, or co-articulations, from one phone to another, together with the natural voice quality of the speaker by whom the diphones have been recorded. For the concatenative synthesis of General American English, a 50x50 matrix of phones = 2,500 potential diphones may be needed. Nonetheless, some sounds never co-occur (e.g. silence-NG), therefore approximately 1,800 diphones are used. It is clear that any real-time mapping of the entire diphone inventory to facial images would impose severe speed and memory restrictions. The diphone inventory needs to be reduced to a set that share visual articulatory features for tongue, teeth, and lips, only. To this end, visible and articulatory archiphones and diphone ‘aliases’ were formalized, to create a system of disemes , as explained in Henton (1994) and below. Using phonological theory (distinctive features) a set of archiphones, that share visual articulatory features for tongue, teeth, and lips was derived. For example, one archiphone has lips spreading and teeth visible. The diphone members of this set are {IY, IH, IX, IR, y}. Another archiphone is comprised of the set of phones that have lips together: {b, p, PX, m}. For each archiphone, it is possible to choose an (abstract, arbitrary) archiphonic ‘group’ representative; for the two sets above these might be IY, and b, respectively. Using the shared distinctive features for all 1800 diphone transitional pairs, a set of archidiphones can be derived. Archidiphones are formed from one archiphone transition to another archiphone, e.g. IY-b. Using this method, ten visual archiphones were created. Potentially, there are 10x10=100 archidiphones; however, for animation purposes, it is unnecessary to store transitions from one archiphone to a member of the same archiphone group (e.g. b-b, or b-p). Further, the diphone ‘silence’ is followed by 9 other archiphones, because a transition to itself (silence-to-silence) is not needed. The other 9 archiphones are followed by 9 archiphones each, because a transition within a group is not needed. Archiphone-silence transitions are needed to create a closed lip-position for end-of-utterance silences. Consequently, the grand product is 10x9 = 90 visually distinctive diphones, or disemes. Disemes begin during one viseme (phone) in an archiphonic family and end somewhere during the following viseme (phone), in another archiphonic family. In this way, a very significant data reduction may achieved: 1800 -> 90 = reduction by a factor of 20. The next step requires a phonetician to record the disemes onto videotape, see Figure 1. Alternatively, line drawings may be traced from the disemes, from which to produce prototype diseme shapes for other talking agents: see Figure 2. By reproducing the full dynamics in a diseme, instead of synthesizing disemes from viseme-to-viseme linear transitions, more natural and fluid animation should be achievable. Disemes can then be recombined and synchronized with the diphone transitions produced by the speech synthesizer in novel, unlimited utterances. Emotions Perception of vocal emotion is culture-specific, and varies according to individual sensitivity, expectations and experience (Murray and Arnott, 1993). The literature on vocal emotions indicates that consensus can generally be reached on perception of ‘baseline’ emotions, along the scales aggressiveness-pleasantness and interest-disgust. Acoustic parameters available for the creation of emotions in the speech synthesizer used here are: overall pitch mean and range; volume, duration and speaking rate, and pitch movements and duration of individual segments. Within the limited scope of this paper, it is impossible to detail values for the acoustic parameters used to create emotions in this concatenative synthesizer. A graphical editor specifies and manipulates the speech directly. Emotions that have been implemented are: anger, happiness, curiosity, sadness, and boredom; see Figure 3. Details of these enhancements are given in Henton and Edelman (forthcoming). Graphics techniques Graphical techniques are needed to eliminate the very noticeable jumps that concatenating live video disemes produces. By animating diseme sequences, a variety of characters can be produced and alignment between disemes can be assured. By using a cross-mapping technique demonstrated in Patterson et al. (1991) in conjunction with an image warping (‘morphing’) algorithm presented in Litwinowicz and Williams (1994), one set of tracked disemes may be used to drive the animated disemes for multiple characters. Currently, however, time limitations have restricted our implementation to linear transitions of visemes derived from an actor. QuickTimeTM playback of pre-stored animated disemes makes it possible to synchronize the disemes with synthetic speech. Conclusion and future work One of the inherent limitations of using disemes stems directly from diphones. Phonetic research has shown that anticipatory co-articulation for the lips extends over several segments, and that the acoustic consequences go far beyond one or two phonetic segments ahead (see e.g. Benguerel and Cowan, 1974). The positions of the lips, teeth, and tongue may thus be altered visibly several hundred milliseconds both before and after the segment currently being spoken by the speech synthesizer. Only when hybrid synthesis produced by a parametric-concatenative system is available can this type of anticipatory modeling be achieved. Similarly, the synthesis system described here is unable to hypo-or hyper-articulate according to the wishes of the speaker, except by using prosodic means; and this may sometimes result in unnatural trajectories. Nevertheless, we consider we have made a convincing initial attempt to overcome the difficulties presented by bimodal text-to-speech synthesis, and that the 2-D on-screen agents are a significant, real-time, computationally low-cost enhancement in human-computer communication. Further work is required to produce more sophisticated speech synthesis, with a greater variety of emotions, and to quantify the gain both in speech intelligibility (e.g. according to the protocol first proposed by Sumby and Pollack, 1954) and in the overall user experience. Using the full dynamics of tracked disemes should produce better diseme animations. Ideally, the goal is to synchronize a hybrid speech synthesizer with even
[1]
Caroline G. Henton,et al.
Beyond visemes: Using disemes in synthetic speech with facial animation
,
1994
.
[2]
Lance Williams,et al.
Animating images with drawings
,
1994,
SIGGRAPH.
[3]
Catherine Pelachaud,et al.
Rule-Structured Facial Animation System
,
1993,
IJCAI.
[4]
Peter C. Litwinowicz,et al.
Facial Animation by Spatial Mapping
,
1991
.
[5]
Akikazu Takeuchi,et al.
Speech Dialogue With Facial Displays: Multimodal Human-Computer Conversation
,
1994,
ACL.
[6]
Nicole Chovil.
Social determinants of facial displays
,
1991
.
[7]
A. Benguerel,et al.
Coarticulation of Upper Lip Protrusion in French
,
1974,
Phonetica.
[8]
P. Ekman,et al.
A new pan-cultural facial expression of emotion
,
1986
.
[9]
Iain R. Murray,et al.
Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion.
,
1993,
The Journal of the Acoustical Society of America.
[10]
W. H. Sumby,et al.
Visual contribution to speech intelligibility in noise
,
1954
.