Towards a low bandwidth talking face using appearance models

Abstract This paper is motivated by the need to develop low bandwidth virtual humans capable of delivering audio-visual speech and sign language at a quality comparable to high bandwidth video. Using an appearance model combined with parameter compression significantly reduces the number of bits required for animating the face of a virtual human. A perceptual method is used to evaluate the quality of the synthesised sequences and it appears that 3.6 kb s−1 can yield acceptable quality.

[1]  Timothy F. Cootes,et al.  Comparing Active Shape Models with Active Appearance Models , 1999, BMVC.

[2]  Keith Waters,et al.  Computer facial animation , 1996 .

[3]  Alice J. O'Toole,et al.  Low-dimensional representation of faces in higher dimensions of the face space , 1993 .

[4]  Alex Pentland,et al.  Coding, Analysis, Interpretation, and Recognition of Facial Expressions , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[6]  Bernard F. Buxton,et al.  Very low bit rate face video compression using linear combination of 2D face views and principal components analysis , 1999, Image Vis. Comput..

[7]  Timothy F. Cootes,et al.  Lipreading Using Shape, Shading and Scale , 1998, AVSP.

[8]  Stephen J. Cox,et al.  A Comparison of Active Shape Model and Scale Decomposition Based Features for Visual Speech Recognition , 1998, ECCV.

[9]  Keith Waters,et al.  A muscle model for animation three-dimensional facial expression , 1987, SIGGRAPH.

[10]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[11]  Susan Gregory,et al.  Deaf Young People and their Families: Developing Understanding , 1995 .

[12]  David Salesin,et al.  Synthesizing realistic facial expressions from photographs , 1998, SIGGRAPH.

[13]  R. D. Johnston Beyond intelligibility : the performance of text-to-speech synthesisers , 1996 .

[14]  D. E. Pearson,et al.  Developments in model-based video coding , 1995, Proc. IEEE.

[15]  Sugato Chakravarty,et al.  Methodology for the subjective assessment of the quality of television pictures , 1995 .

[16]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[17]  Iain Matthews,et al.  Features for Audio-Visual Speech Recognition , 1998 .

[18]  N. Michael Brooke,et al.  Two- and Three-Dimensional Audio-Visual Speech Synthesis , 1998, AVSP.

[19]  Frederic I. Parke,et al.  A parametric model for human faces. , 1974 .

[20]  Takeo Kanade,et al.  Recognizing Action Units for Facial Expression Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Timothy F. Cootes,et al.  Learning to identify and track faces in image sequences , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[22]  Timothy F. Cootes,et al.  A Comparative Evaluation of Active Appearance Model Algorithms , 1998, BMVC.

[23]  Dominique Valentin,et al.  Eigenfeatures as intermediate-level representations: The case for PCA models , 1998, Behavioral and Brain Sciences.

[24]  Gavin C. Cawley,et al.  2.5D Visual Speech Synthesis Using Appearance Models , 2003, BMVC.

[25]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[26]  Michael Isard,et al.  Statistical models of visual shape and motion , 1998, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[27]  Joel Max,et al.  Quantizing for minimum distortion , 1960, IRE Trans. Inf. Theory.

[28]  Tony Ezzat,et al.  Visual Speech Synthesis by Morphing Visemes , 2000, International Journal of Computer Vision.

[29]  Timothy F. Cootes,et al.  Training Models of Shape from Sets of Examples , 1992, BMVC.