Development of a visual speech synthesizer via second-order isomorphism

The goals of this study were to evaluate the synthesis of visible speech that was based on 3-D motion data using second-order isomorphism. To do this, word stimuli were generated for perceptual discrimination and identification tasks. Discrimination trials were based on word-pairs that were predicted to be at four levels of perceptual dissimilarity. Results from the discrimination tasks indicated that visual synthetic speech perception maintained the dissimilarity structure of visual natural speech perception. This study demonstrated that the relatively sparse 3-D representations of face motion could be used to synthesize visual speech that perceptually approximate visual natural speech, suggesting that synthesizer development and psychophysics can benefit mutually when the goals are aligned.

[1]  L. Bernstein,et al.  Similarity structure in visual speech perception and optical phonetic signals , 2007, Perception & psychophysics.

[2]  D. Massaro Perceiving talking faces: from speech perception to a behavioral principle , 1999 .

[3]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[4]  L. Bernstein,et al.  Speech perception without hearing , 2000, Perception & psychophysics.

[5]  Gérard Bailly,et al.  Learning optimal audiovisual phasing for an HMM-based control model for facial animation , 2007, SSW.

[6]  D. Pisoni,et al.  Recognizing Spoken Words: The Neighborhood Activation Model , 1998, Ear and hearing.

[7]  Gérard Bailly,et al.  Audiovisual speech synthesis. from ground truth to models , 2002, INTERSPEECH.

[8]  L. Bernstein,et al.  Stimulus-based lexical distinctiveness as a general word-recognition mechanism , 2002, Perception & psychophysics.

[9]  E. T. Auer The influence of the lexicon on speech read word recognition: Contrasting segmental and lexical distinctiveness , 2002, Psychonomic bulletin & review.

[10]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[11]  Abeer Alwan,et al.  Acoustically-Driven Talking Face Synthesis using Dynamic Bayesian Networks , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[12]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[13]  A. Liberman,et al.  Some Experiments on the Perception of Synthetic Speech Sounds , 1952 .

[14]  Keith Waters,et al.  Computer facial animation , 1996 .

[15]  Abeer Alwan,et al.  On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics , 2002, EURASIP J. Adv. Signal Process..

[16]  John Yen,et al.  Emotionally expressive agents , 1999, Proceedings Computer Animation 1999.

[17]  R. Shepard,et al.  Second-order isomorphism of internal representations: Shapes of states ☆ , 1970 .

[18]  Christian Benoît,et al.  Which components of the face do humans and machines best speechread , 1996 .

[19]  Ronald A. Cole,et al.  Accurate visible speech synthesis based on concatenating variable length motion capture data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[20]  M E Demorest,et al.  A computational approach to analyzing sentential speech perception: phoneme-to-phoneme stimulus-response alignment. , 1994, The Journal of the Acoustical Society of America.