Calibrating Head Pose Estimation in Videos for Meeting Room Event Analysis

In this paper, we study the calibration of head pose estimation in stereo camera setting for meeting room video event analysis. Head pose information infers the direction of attention of the subjects in video, therefore is valuable for video event analysis/indexing, especially in meeting room scenario. We are developing a multi-modal meeting room data analyzing system for studying meeting room interaction dynamics, in which head pose estimation is one of the key components. As each subject in the meeting room can be observed by a pair of stereo cameras, we do 2D head tracking for the subject in each camera, and the 3D coordinate of the head can be obtained by triangulation. The 3D head pose is estimated in one of the camera coordinate system, we develop a procedure to accurately convert the estimated 3D pose in the camera coordinate system to that in the world coordinate system. In the experiment, visualization of the estimated head pose and location in world coordinate system verifies the soundness of our design. The estimated head pose and 3D location of the subjects in the meeting room allows further analysis of meeting room interaction dynamics, such as F-formation, floor-control, etc.

[1]  Vincent Lepetit,et al.  Fusing online and offline information for stable 3D tracking in real-time , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[2]  Thomas S. Huang,et al.  Accurate Head Pose Tracking in Low Resolution Video , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[3]  Peter Eisert,et al.  Model-aided coding: a new approach to incorporate facial animation into motion-compensated video coding , 2000, IEEE Trans. Circuits Syst. Video Technol..

[4]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Hagen Soltau,et al.  The ISL Meeting Room System , 2001 .

[6]  David McNeill,et al.  Gesture, Gaze, and Ground , 2005, MLMI.

[7]  Alex Waibel,et al.  Detecting Emotions in Speech , 1998 .

[8]  Mary P. Harper,et al.  VACE Multimodal Meeting Corpus , 2005, MLMI.

[9]  Thomas S. Huang,et al.  Explanation-based facial motion tracking using a piecewise Bezier volume deformation model , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[10]  David J. Kriegman,et al.  Visual tracking using learned linear subspaces , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..