Estimating head pose and state of facial elements for sign language video

Currently there is an increasing need of automatic video analysis and annotation tools to support linguists in their studies of sign language. Up to now, the amount of studies focusing on automatic annotation of non-manual gestures in sign language videos have been limited (e.g. Metaxas et al., 2012). Therefore, in this work we study methods for automatic estimation of three head pose angles and the state of facial elements (eyebrow position, eye openness, and mouth state). The estimation methods we propose are incorporated in our publicly available SLMotion software package (Karppa et al., 2014) for sign language video processing and analysis. Head pose can be described by the three angles: yaw, pitch, and roll (Figure 1a). We propose an approach for head pose estimation from images based on two kinds of visual features. The first group of features is formed by the set of facial landmark points extracted using the flandmark software library (Uřicař et al., 2012). Secondly, as novel additional features we use a tonal segmentation mask of skin-like colors within the face bounding box. The roll angle is estimated using a geometric approach based on the location of the eye landmarks. The yaw and pitch angles are estimated with separate Support Vector Regression (Smola and Scholkopf, 2004) classifiers. The classifiers are trained using the full set of features extracted from a subset of 684 annotated images from the Pointing04 image database (Gourier et al., 2004) of subjects performing various head poses. In the image subset, the pose varies from −45◦ to +45◦ in yaw and from −30◦ to +30◦ in pitch, with 15◦ steps from one pose to another. The performance of the proposed head pose estimation method was evaluated in two experimental settings. In the first series of experiments a separate subset of the Pointing04 data was used to measure the accuracy of the trained yaw and pitch regressors. In these experiments, our accuracy levels compare favorably with published results on the same data. Some of the observed advantage might be due to the use of only near frontal pose angles in our evaluation. Roll detection was not evaluated as the Pointing04 database does not include reference annotations for roll. In the second experiment continuous signing was recorded both by a video camera and motion capture equipment (Figure 2). The pose angles were estimated from video frames and the estimates compared with the ground truth values from the motion capture recording. In this experiment the estimated angles for yaw and roll had a very high correlation with the ground truth, whereas for pitch the correlation was slightly lower. We conclude that our video-based estimators are very well comparable with motion capture measurements. The method we propose for estimating eyebrow position, eye openness, and mouth state is based on the construction of an extended set of facial landmarks (Figure 1b). These estimate the position of eyebrows, eyelids, and upper and lower lip boundaries which are not part of the flandmark output. The detection of extended landmarks is initialized by using the flandmark set for coarse location identification of facial elements. The proposed landmark detection method then employs an ensemble of techniques for each facial element: oriented projections and pixel similarity for eyebrows, oriented projections and radial transform for eyes, and pseudo-hue mask for lips. We also consider landmarks detected using the recently presented appearance-based Supervised Descent Method that is implemented in the IntraFace software package (Human Sensing Laboratory and Affect Analysis Group, 2013). The extended landmarks are used to calculate a set of geometric features which are further postprocessed with Principal Component Analysis. The processed features function as input to statistical classifiers trained to produce quantized estimates of eyebrow position, eye openness, and mouth state (Table 1). We evaluated the performance of our facial element state estimators in a quantitative and qualitative type of experiments. For the first experiment set we manually annotated the facial states in videos taken from the SUVI dictionary of Finnish Sign Language (Suvi, 2003). The annotations were performed frame-by-frame on basis of the visual appearance of the isolated frame, without regard to linguistic significance of the face states. We evaluated our automatic estimators against the manual annotations using the Mathews’ Correlation Coefficient (Powers, 2011). In the experiments (Figure 3), eye openness and vertical mouth state estimators had high correlations with the ground truth. The eyebrow position and horizontal mouth state estimators were quite noisy, often missing subtle changes and being sensitive to small variations in head pose. For the qualitative set of experiments we employed annotations prepared for a subset of the SUVI material in an earlier work (Jantunen, 2007) from the point of view of linguistic significance. In the evaluation example (Figure 3), the automatic estimations correctly detected blinks and squints with few mislabeled frames. Furthermore, the eyebrow estimations coincided with the linguistic annotations except in the non-linguistic visual changes or perspective illusions (head tilting). In summary, we have proposed methods for automatic estimation of two types of non-manual elements from sign language video: head pose and facial element state. The experimental results testify of a promising progress on these two separate fronts towards automatic annotation of non-manuals in sign language.

[1]  Tommi Jantunen,et al.  The equative sentence in Finnish Sign Language , 2007 .

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Jorma Laaksonen,et al.  SLMotion - An extensible sign language oriented video analysis tool , 2014, LREC.

[4]  Juan Bernardo Gómez-Mendoza,et al.  A contribution to mouth structure segmentation in images towards automatic mouth gesture recognition , 2012 .

[5]  Antoine Picot,et al.  Using retina modelling to characterize blinking: comparison between EOG and video analysis , 2012, Machine Vision and Applications.

[6]  Krňoul Zdeněk,et al.  Automatic Fingersign to Speech Translator , 2010 .

[7]  Roland Pfau,et al.  Nonmanuals: their grammatical and prosodic roles , 2010 .

[8]  Erhardt Barth,et al.  Accurate Eye Centre Localisation by Means of Gradients , 2011, VISAPP.

[9]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Václav Hlavác,et al.  Detector of Facial Landmarks Learned by the Structured Output SVM , 2012, VISAPP.

[11]  Mohan M. Trivedi,et al.  Head Pose Estimation in Computer Vision: A Survey , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Qiang Ji,et al.  In the Eye of the Beholder: A Survey of Models for Eyes and Gaze , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[14]  Hermann Ney,et al.  Signspeak--understanding, recognition, and translation of sign languages , 2010 .

[15]  Bernt Schiele,et al.  Comprehensive Colour Image Normalization , 1998, ECCV.

[16]  Fei Yang,et al.  Recognizing eyebrow and periodic head gestures using CRFs for non-manual grammatical marker detection in ASL , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[17]  Fei Yang,et al.  Recognition of Nonmanual Markers in American Sign Language (ASL) Using Non-Parametric Adaptive 2D-3D Face Tracking , 2012, LREC.

[18]  Benzai Deng,et al.  Facial Expression Recognition using AAM and Local Facial Features , 2007, Third International Conference on Natural Computation (ICNC 2007).

[19]  J. Crowley,et al.  Estimating Face orientation from Robust Detection of Salient Facial Structures , 2004 .

[20]  Alice Caplier,et al.  Lip contour segmentation and tracking compliant with lip-reading application constraints , 2012, Machine Vision and Applications.

[21]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[22]  Javier R. Movellan,et al.  A discriminative approach to frame-by-frame head pose tracking , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[23]  Franck Luthon,et al.  Nonlinear color space and spatiotemporal MRF for hierarchical segmentation of face features in video , 2004, IEEE Transactions on Image Processing.

[24]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[25]  Zia-ur Rahman,et al.  Properties and performance of a center/surround retinex , 1997, IEEE Trans. Image Process..