A System Theoretic Approach to Synthesis and Classification of Lip Articulation

We present a system for synthesizing lip movements and recognizing speakers/phrases from visual lip sequences. Low-dimensional geometrical lip features, such as trajectories of landmarks on the outer lip contour and vertical distances for the mouth opening are first extracted from the images. The temporal evolution of these features is modeled with linear dynamical systems, whose parameters are learned using system identification techniques. By carefully exploiting physical constraints of lip movement both in the learning and synthesis stages, realistic synthesis of novel sequences is achieved. Recognition is performed using classification methods, such as nearest neighbors and support vector machines, combined with various metrics based on subspace angles and kernels, such as the Binet-Cauchy, Martin, and Kullback-Leibler kernels. Experiments are designed to find the combination of features, identification method, kernel and classification method that is most appropriate for synthesis and classification of lip articulation.

[1]  Martin Szummer,et al.  Temporal texture modeling , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[2]  Alice Caplier,et al.  Accurate and quasi-automatic lip tracking , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  B. Moor,et al.  Subspace angles and distances between ARMA models , 2000 .

[4]  Harry Shum,et al.  Synthesizing Dynamic Texture with Closed-Loop Linear Dynamic System , 2004, ECCV.

[5]  Jörn Ostermann,et al.  Lifelike talking faces for interactive services , 2003, Proc. IEEE.

[6]  Payam Saisan,et al.  Modeling and Synthesis of Facial Motion Driven by Speech , 2004, ECCV.

[7]  Payam Saisan,et al.  Dynamic texture recognition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[8]  Bart De Moor,et al.  N4SID: Subspace algorithms for the identification of combined deterministic-stochastic systems , 1994, Autom..

[9]  A. Murat Tekalp,et al.  Discriminative Analysis of Lip Motion Features for Speaker Identification and Speech-Reading , 2006, IEEE Transactions on Image Processing.

[10]  Alexander J. Smola,et al.  Binet-Cauchy Kernels on Dynamical Systems and its Application to the Analysis of Dynamic Scenes , 2007, International Journal of Computer Vision.

[11]  Dietmar Bauer,et al.  Asymptotic properties of subspace estimators , 2005, Autom..

[12]  片山 徹 Subspace methods for system identification , 2005 .

[13]  Rama Chellappa,et al.  A system identification approach for video-based face recognition , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[14]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[15]  Stefano Soatto,et al.  Editable dynamic textures , 2002, SIGGRAPH '02.

[16]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[17]  Nuno Vasconcelos,et al.  Probabilistic kernels for the classification of auto-regressive visual processes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18]  Richard J. Martin A metric for ARMA processes , 2000, IEEE Trans. Signal Process..

[19]  Scott A. King,et al.  Creating speech-synchronized animation , 2005, IEEE Transactions on Visualization and Computer Graphics.

[20]  Stefano Soatto,et al.  Recognition of human gaits , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.