Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling

We investigate the problem of visual tracking of multiple human speakers in an office environment. In particular, we propose novel solutions to the following challenges: (1) robust and computationally efficient modeling and classification of the changing appearance of the speakers in a variety of different lighting conditions and camera resolutions; (2) dealing with full or partial occlusions when multiple speakers cross or come into very close proximity; (3) automatic initialization of the trackers, or re-initialization when the trackers have lost lock caused by e.g. the limited camera views. First, we develop new algorithms for appearance modeling of the moving speakers based on dictionary learning (DL), using an off-line training process. In the tracking phase, the histograms (coding coefficients) of the image patches derived from the learned dictionaries are used to generate the likelihood functions based on Support Vector Machine (SVM) classification. This likelihood function is then used in the measurement step of the classical particle filtering (PF) algorithm. To improve the computational efficiency of generating the histograms, a soft voting technique based on approximate Locality-constrained Soft Assignment (LcSA) is proposed to reduce the number of dictionary atoms (codewords) used for histogram encoding. Second, an adaptive identity model is proposed to track multiple speakers whilst dealing with occlusions. This model is updated online using Maximum a Posteriori (MAP) adaptation, where we control the adaptation rate using the spatial relationship between the subjects. Third, to enable automatic initialization of the visual trackers, we exploit audio information, the Direction of Arrival (DOA) angle, derived from microphone array recordings. Such information provides, a priori, the number of speakers and constrains the search space for the speaker's faces. The proposed system is tested on a number of sequences from three publicly available and challenging data corpora (AV16.3, EPFL pedestrian data set and CLEAR) with up to five moving subjects.

[1]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[2]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Rainer Stiefelhagen,et al.  Audio-visual multi-person tracking and identification for smart environments , 2007, ACM Multimedia.

[4]  Jun Zhang,et al.  A face tracking algorithm based on LBP histograms and particle filtering , 2010, 2010 Sixth International Conference on Natural Computation.

[5]  Yang Wang,et al.  Robust facial feature tracking under varying face pose and facial expression , 2007, Pattern Recognit..

[6]  Pascal Fua,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Multiple Object Tracking Using K-shortest Paths Optimization , 2022 .

[7]  Dorin Comaniciu,et al.  Real-time tracking of non-rigid objects using mean shift , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[8]  Ehud Rivlin,et al.  Robust Fragments-based Tracking using the Integral Histogram , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9]  Ming-Hsuan Yang,et al.  Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[10]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[11]  Samy Bengio,et al.  A comparative study of adaptation methods for speaker verification , 2002, INTERSPEECH.

[12]  Javed Ahmed,et al.  Robust Edge-Enhanced Fragment Based Normalized Correlation Tracking in Cluttered and Occluded Imagery , 2009, FGIT-SIP.

[13]  Josef Kittler,et al.  The University of Surrey Visual Concept Detection System at ImageCLEF@ICPR: Working Notes , 2010, ICPR.

[14]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[15]  Huiyu Zhou,et al.  Object tracking using SIFT features and mean shift , 2009, Comput. Vis. Image Underst..

[16]  Josef Kittler,et al.  The University of Surrey Visual Concept Detection System at ImageCLEF@ICPR: Working Notes , 2010, 2010 20th International Conference on Pattern Recognition.

[17]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[18]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[19]  Ming-Hsuan Yang,et al.  Visual tracking with online Multiple Instance Learning , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  David Zhang,et al.  Robust Object Tracking Using Joint Color-Texture Histogram , 2009, Int. J. Pattern Recognit. Artif. Intell..

[21]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[22]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[23]  Krystian Mikolajczyk,et al.  Soft assignment of visual words as Linear Coordinate Coding and optimisation of its reconstruction error , 2011, 2011 18th IEEE International Conference on Image Processing.

[24]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Luc Van Gool,et al.  An adaptive color-based particle filter , 2003, Image Vis. Comput..

[26]  Krystian Mikolajczyk,et al.  Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection , 2013, Comput. Vis. Image Underst..

[27]  Jake K. Aggarwal,et al.  Tracking human motion using multiple cameras , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[28]  Yuan Li,et al.  Tracking in Low Frame Rate Video: A Cascade Particle Filter with Discriminative Observers of Different Lifespans , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Zuoyong Li,et al.  A New Face Tracking Algorithm Based on Local Binary Pattern and Skin Color Information , 2008, 2008 International Symposium on Computer Science and Computational Technology.

[30]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[31]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[33]  Junzhou Huang,et al.  Robust tracking using local sparse appearance model and K-selection , 2011, CVPR 2011.

[34]  Jun S. Liu,et al.  Sequential Imputations and Bayesian Missing Data Problems , 1994 .

[35]  Guillaume Lathoud,et al.  A sector-based, frequency-domain approach to detection and localization of multiple speakers , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[36]  Branko Ristic,et al.  A particle filter for joint detection and tracking of color objects , 2007, Image Vis. Comput..

[37]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[38]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[40]  Vincent Lepetit,et al.  Feature Harvesting for Tracking-by-Detection , 2006, ECCV.

[41]  Sim Heng Ong,et al.  Probability Hypothesis Density Approach for Multi-camera Multi-object Tracking , 2007, ACCV.

[42]  Jean-Marc Odobez,et al.  AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[43]  Wen Gao,et al.  Online selecting discriminative tracking features using particle filter , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[44]  Josef Kittler,et al.  Visual category recognition using Spectral Regression and Kernel Discriminant Analysis , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[45]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[46]  Julien Bourgeois,et al.  Sector-Based Detection for Hands-Free Speech Enhancement in Cars , 2006, EURASIP J. Adv. Signal Process..

[47]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[48]  Cordelia Schmid,et al.  Comparison of affine-invariant local detectors and descriptors , 2004, 2004 12th European Signal Processing Conference.

[49]  Liang-Tien Chia,et al.  Local features are not lonely – Laplacian sparse coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[50]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[51]  Miao Yu,et al.  A Multimodal Approach to Blind Source Separation of Moving Sources , 2010, IEEE Journal of Selected Topics in Signal Processing.

[52]  Shaogang Gong,et al.  Tracking Multiple People Under Occlusion Using Multiple Cameras , 2000, BMVC.

[53]  I. McCowan,et al.  PROBABILISTIC TRACKING OF MULTIPLE SPEAKERS IN MEETINGS , 2007 .

[54]  Tieniu Tan,et al.  Real time hand tracking by combining particle filtering and mean shift , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..