Analysis and synthesis of the 3D movements of the head, face and hand of a speaker using cued speech

This paper present here efforts for characterizing the 3D movements of the right hand and the face of a French female during the production of cued speech. The 3D trajectories of 50 hand and 63 facial fleshpoints during the production of 238 utterances are analyzed. These utterances are carefully designed to cover all possible diphones of French. Linear and nonlinear statistical models of the deformations and the postures of the hand and the face have been developed using separate and joint corpora. Recognition of hand and face postures at targets is performed to verify a posteriori that key hand movements and postures imposed by cued speech had been well realized by the subject. Recognition results are further exploited in order to study the phonetic structure of cued speech notably the phasing relations between hand gestures and sound production. A first implementation of a concatenative audiovisual text-to-cued speech synthesis system is finally described that employs this unique and extensive data on cued speech in action.

[1]  C M Reed,et al.  Automatic speech recognition to aid the hearing impaired: prospects for the automatic generation of cued speech. , 1994, Journal of rehabilitation research and development.

[2]  Paul Duchnowski,et al.  Development of speechreading supplements based on automatic speech recognition , 2000, IEEE Trans. Biomed. Eng..

[3]  Gérard Bailly,et al.  A trainable prosodic model: learning the contours implementing communicative functions within a superpositional model of intonation , 2004, INTERSPEECH.

[4]  Gérard Bailly,et al.  Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images , 2002, J. Phonetics.

[5]  Gérard Bailly,et al.  MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[6]  Takaaki Kuratate,et al.  Audio-visual synthesis of talking faces from speech production correlates. , 1999 .

[7]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[8]  M. Jeannerod,et al.  Temporal dissociation of motor responses and subjective awareness. A study in normal subjects. , 1991, Brain : a journal of neurology.

[9]  J. Leybaert,et al.  Phonology acquired through the eyes and spelling in deaf children. , 2000, Journal of experimental child psychology.

[10]  Gérard Bailly,et al.  Audiovisual speech synthesis. from ground truth to models , 2002, INTERSPEECH.

[11]  Andrew P. Breen,et al.  Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis , 2000, INTERSPEECH.

[12]  Denis Beautemps,et al.  A pilot study of temporal organization in Cued Speech production of French syllables: rules for a Cued Speech synthesizer , 2004, Speech Commun..

[13]  L. Bernstein,et al.  Speech perception without hearing , 2000, Perception & psychophysics.

[14]  Gérard Bailly,et al.  Shape and appearance models of talking faces for model-based tracking , 2003, 2003 IEEE International SOI Conference. Proceedings (Cat. No.03CH37443).

[15]  G. H. Nicholls,et al.  Cued Speech and the reception of spoken language. , 1982, Journal of speech and hearing research.

[16]  Yoshinori Sagisaka,et al.  Computing Prosody, Computational Models for Processing Spontaneous Speech , 2011 .

[17]  R O Cornett Cued speech, manual complement to lipreading, for visual reception of spoken language. Principles, practice and prospects for automation. , 1988, Acta oto-rhino-laryngologica Belgica.