Audiovisual Head Orientation Estimation with Particle Filtering in Multisensor Scenarios

This article presents a multimodal approach to head pose estimation of individuals in environments equipped with multiple cameras and microphones, such as SmartRooms or automatic video conferencing. Determining the individuals head orientation is the basis for many forms of more sophisticated interactions between humans and technical devices and can also be used for automatic sensor selection (camera, microphone) in communications or video surveillance systems. The use of particle filters as a unified framework for the estimation of the head orientation for both monomodal and multimodal cases is proposed. In video, we estimate head orientation from color information by exploiting spatial redundancy among cameras. Audio information is processed to estimate the direction of the voice produced by a speaker making use of the directivity characteristics of the head radiation pattern. Furthermore, two different particle filter multimodal information fusion schemes for combining the audio and video streams are analyzed in terms of accuracy and robustness. In the first one, fusion is performed at a decision level by combining each monomodal head pose estimation, while the second one uses a joint estimation system combining information at data level. Experimental results conducted over the CLEAR 2006 evaluation database are reported and the comparison of the proposed multimodal head pose estimation algorithms with the reference monomodal approaches proves the effectiveness of the proposed approach.

[1]  Larry S. Davis,et al.  Computing 3-D head orientation from a monocular image sequence , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[2]  Lynn Wilcox,et al.  Room with a Rear View: Meeting Capture in a Multimedia Conference Room , 2000, IEEE Multim..

[3]  Rainer Stiefelhagen,et al.  Computers in the Human Interaction Loop , 2009, Human-Computer Interaction Series.

[4]  Yong Rui,et al.  Real-time speaker tracking using particle filter sensor fusion , 2004, Proceedings of the IEEE.

[5]  Simon J. Godsill,et al.  Particle methods for Bayesian modeling and enhancement of speech signals , 2002, IEEE Trans. Speech Audio Process..

[6]  Michael S. Brandstein,et al.  Robust head pose estimation by machine learning , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[7]  Sumit Roy,et al.  Decentralized structures for parallel Kalman filtering , 1988 .

[8]  Carlos Segura,et al.  Multimodal Head Orientation Towards Attention Tracking in Smartrooms , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[10]  Harvey F. Silverman,et al.  Characterization of talker radiation pattern using a microphone array , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Montse Pardàs,et al.  Fusion of multiple viewpoint information towards 3D face robust orientation detection , 2005, IEEE International Conference on Image Processing 2005.

[12]  Alessio Brutti,et al.  Oriented global coherence field for the estimation of the head orientation in smart rooms equipped with distributed microphone arrays , 2005, INTERSPEECH.

[13]  Adolfo López,et al.  Multi-Person 3D Tracking with Particle Filters on Voxels , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  Michael J. Black,et al.  The Digital Office: Overview , 1998 .

[15]  Alexander G. Hauptmann,et al.  Towards robust face recognition from multiple views , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[16]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[17]  Trevor Darrell,et al.  Multiple person and speaker activity tracking with a particle filter , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Aristodemos Pnevmatikakis,et al.  2D Person Tracking Using Kalman Filtering and Adaptive Background Learning in a Feedback Loop , 2006, CLEAR.

[19]  Climent Nadeu,et al.  Audio person tracking in a smart-room environment , 2006, INTERSPEECH.

[20]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[21]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Helge J. Ritter,et al.  Recognition of human head orientation based on artificial neural networks , 1998, IEEE Trans. Neural Networks.

[23]  Montse Pardàs,et al.  Towards a Bayesian Approach to Robust Finding Correspondences in Multiple View Geometry Environments , 2005, International Conference on Computational Science.

[24]  Steve Young,et al.  Polynomial Softmax Functions for Pattern Classification , 2001 .

[25]  Oswald Lanz,et al.  Approximate Bayesian multibody tracking , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Rainer Stiefelhagen,et al.  Neural Network-Based Head Pose Estimation and Multi-view Fusion , 2006, CLEAR.

[27]  George C. Stockman,et al.  Controlling a computer via facial aspect , 1995, IEEE Trans. Syst. Man Cybern..

[28]  Michael A. West,et al.  Bayesian forecasting and dynamic models (2nd ed.) , 1997 .

[29]  Ramakant Nevatia,et al.  Speaker Tracking in Seminars by Human Body Detection , 2006, CLEAR.

[30]  James L. Crowley,et al.  Head Pose Estimation on Low Resolution Images , 2006, CLEAR.

[31]  John W. McDonough,et al.  A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[32]  James M. Rehg,et al.  Statistical Color Models with Application to Skin Detection , 2004, International Journal of Computer Vision.

[33]  Ian D. Reid,et al.  Articulated Body Motion Capture by Stochastic Search , 2005, International Journal of Computer Vision.

[34]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[35]  Michael S. Brandstein,et al.  Robust automatic video-conferencing with multiple cameras and microphones , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[36]  Rainer Stiefelhagen,et al.  Tracking focus of attention in meetings , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[37]  N. Gordon,et al.  Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[38]  Jeffrey B. Mulligan,et al.  Model-based head pose estimation for air-traffic controllers , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[39]  Mohan M. Trivedi,et al.  Articulated body posture estimation from multi-camera voxel data , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[40]  Harvey F. Silverman,et al.  A baseline algorithm for estimating talker orientation using acoustical data from a large-aperture microphone array , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  Yuxiao Hu,et al.  Head Pose Estimation in Seminar Room Using Multi View Face Detectors , 2006, CLEAR.

[42]  Alexander H. Waibel CHIL - Computers in the Human Interaction Loop , 2005, MVA.

[43]  Alexander Zelinsky,et al.  An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[44]  Liang Zhao,et al.  Real-time head orientation estimation using neural networks , 2002, Proceedings. International Conference on Image Processing.