Audio-visual face detection for tracking in a meeting room environment

A key task in many applications such as tracking or face recognition is the detection and localisation of a subject's face in an image. This can still prove to be a challenging task particularly in low resolution or noisy images. Here we propose a robust method for face detection using both audio and visual information. We construct a dictionary learning based face detector using a set of distinctive and robust image features. We then train a support vector machine classifier using sparse image representations produced by this dictionary to classify face versus background. This is combined with the azimuth angle of the speaker produced by an audio localisation system to constrain the search space for the subject's face. This increases the efficiency of the detection and localisation process by limiting the search area. However, more importantly, the audio information allows us to know a priori the number of subjects in the image. This greatly reduces the possibility of false positive face detections. We demonstrate the advantage of this proposed approach over traditional face detection methods on the challenging AV16.3 dataset.

[1]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[2]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[3]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  Yang Wang,et al.  Robust facial feature tracking under varying face pose and facial expression , 2007, Pattern Recognit..

[5]  Baoxin Li,et al.  Discriminative K-SVD for dictionary learning in face recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Jake K. Aggarwal,et al.  Tracking human motion using multiple cameras , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[7]  Miao Yu,et al.  A Multimodal Approach to Blind Source Separation of Moving Sources , 2010, IEEE Journal of Selected Topics in Signal Processing.

[8]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Julien Bourgeois,et al.  Sector-Based Detection for Hands-Free Speech Enhancement in Cars , 2006, EURASIP J. Adv. Signal Process..

[10]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[11]  Paul A. Viola,et al.  Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos , 2008, IEEE Transactions on Multimedia.

[12]  Cordelia Schmid,et al.  Comparison of affine-invariant local detectors and descriptors , 2004, 2004 12th European Signal Processing Conference.

[13]  Zhengyou Zhang,et al.  A Survey of Recent Advances in Face Detection , 2010 .

[14]  Josef Kittler,et al.  The University of Surrey Visual Concept Detection System at ImageCLEF@ICPR: Working Notes , 2010, ICPR.

[15]  Josef Kittler,et al.  The University of Surrey Visual Concept Detection System at ImageCLEF@ICPR: Working Notes , 2010, 2010 20th International Conference on Pattern Recognition.

[16]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Wang Chuan-xu,et al.  A New Face Tracking Algorithm Based on Local Binary Pattern and Skin Color Information , 2008 .

[19]  Josef Kittler,et al.  Visual category recognition using Spectral Regression and Kernel Discriminant Analysis , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[20]  Zuoyong Li,et al.  A New Face Tracking Algorithm Based on Local Binary Pattern and Skin Color Information , 2008, 2008 International Symposium on Computer Science and Computational Technology.

[21]  Rama Chellappa,et al.  Illumination robust dictionary-based face recognition , 2011, 2011 18th IEEE International Conference on Image Processing.

[22]  Huiyu Zhou,et al.  Object tracking using SIFT features and mean shift , 2009, Comput. Vis. Image Underst..

[23]  Rama Chellappa,et al.  Dictionary-Based Face Recognition Under Variable Lighting and Pose , 2012, IEEE Transactions on Information Forensics and Security.

[24]  David Zhang,et al.  Robust Object Tracking Using Joint Color-Texture Histogram , 2009, Int. J. Pattern Recognit. Artif. Intell..

[25]  Krystian Mikolajczyk,et al.  Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection , 2013, Comput. Vis. Image Underst..

[26]  Tieniu Tan,et al.  Real time hand tracking by combining particle filtering and mean shift , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[27]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[28]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Jun Zhang,et al.  A face tracking algorithm based on LBP histograms and particle filtering , 2010, 2010 Sixth International Conference on Natural Computation.

[30]  Krystian Mikolajczyk,et al.  Soft assignment of visual words as Linear Coordinate Coding and optimisation of its reconstruction error , 2011, 2011 18th IEEE International Conference on Image Processing.

[31]  Guillaume Lathoud,et al.  A sector-based, frequency-domain approach to detection and localization of multiple speakers , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..