Temporal Synchronization and Normalization of Speech Videos for Face Recognition

Automatic Face Recognition (AFR) is a domain that provides various advantages over other biometrics, such as acceptability and ease of use, but due to the current trends, the identification rates are still low as compared to more traditional biometrics, such as fingerprints. Image based face recognition, was the mainstay of AFR for several decades but quickly gave way to video based AFR with the arrival of inexpensive video cameras and enhanced processing power. Video based face recognition has several advantages over image based techniques, the two main being, more data for pixel-based techniques, and availability of temporal information. But with these advantages there are some inconveniences also, the foremost being the augmentation of variation. In the classical image based face recognition degraded performance has mostly been attributed to three main sources of variation in the human face, these being pose, illumination and expression. Among these, pose has been quite problematic both in its effects on the recognition results and the difficulty to compensate for it. Techniques that have been studied for handling pose in face recognition can be classified in 3 categories, first are the ones that estimates an explicit 3D model of the face (Blanz & Vetter, 2003) and then use the parameters of the model for pose compensation, second are subspace based such as eigenspace (Matta & Dugelay, 2008) and the third type are those which build separate subspaces for each pose of the face such as view-based eigenspace (Lee & Kriegman, 2005). Managing illumination variation in videos has been relatively less studied as compared to pose, mostly image based techniques are extended to video. The two classical image based techniques that have been extended for video with relative success are illumination cones (Georghiades et al., 1998) and 3D morphable models (Blanz & Vetter, 2003). Lastly expression invariant face recognition techniques can be divided in two categories, first are based on subspace methods that model the facial deformations (Tsai et al., 2007). Next are techniques that use morphing techniques (Ramachandran et al., 2005), who morph a smiling into a neutral face.

[1]  Demetri Terzopoulos,et al.  Snakes: Active contour models , 2004, International Journal of Computer Vision.

[2]  Yoshinobu Tonomura,et al.  Video tomography: an efficient method for camerawork extraction and motion analysis , 1994, MULTIMEDIA '94.

[3]  David J. Kriegman,et al.  Illumination cones for recognition under variable lighting: faces , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[4]  Ye-peng Guan,et al.  Automatic extraction of lips based on multi-scale wavelet edge detection , 2008 .

[5]  Chung-Lin Huang,et al.  Facial Expression Recognition Using Model-Based Feature Extraction and Action Parameters Classification , 1997, J. Vis. Commun. Image Represent..

[6]  Rama Chellappa,et al.  A method for converting a smiling face to a neutral face with applications to face recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Tony F. Chan,et al.  Active contours without edges , 2001, IEEE Trans. Image Process..

[8]  Ulrich Canzler,et al.  Extraction of Non Manual Features for Videobased Sign Language Recognition , 2002, MVA.

[9]  Alan Wee-Chung Liew,et al.  Segmentation of color lip images by spatial fuzzy clustering , 2003, IEEE Trans. Fuzzy Syst..

[10]  Claude C. Chibelushi,et al.  Robust Facial Feature Tracking , 2000, BMVC.

[11]  David J. Kriegman,et al.  Online learning of probabilistic appearance manifolds for video-based recognition and tracking , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[13]  K. Sugiyama,et al.  Motion compensated frame rate conversion using normalized motion estimation , 2005, IEEE Workshop on Signal Processing Systems Design and Implementation, 2005..

[14]  Somnath Sengupta,et al.  Lip Localization and Viseme Recognition from Video Sequences , 2007 .

[15]  Tom Hintz,et al.  Kernel-based Subspace Analysis for Face Recognition , 2007, 2007 International Joint Conference on Neural Networks.

[16]  Thomas Vetter,et al.  Face Recognition Based on Fitting a 3D Morphable Model , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  George Wolberg,et al.  Recent advances in image morphing , 1996, Proceedings of CG International '96.

[19]  Jean-Luc Dugelay,et al.  Tomofaces: Eigenfaces extended to videos of speakers , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.