Mean-Shift and Sparse Sampling-Based SMC-PHD Filtering for Audio Informed Visual Speaker Tracking

The probability hypothesis density (PHD) filter based on sequential Monte Carlo (SMC) approximation (also known as SMC-PHD filter) has proven to be a promising algorithm for multispeaker tracking. However, it has a heavy computational cost as surviving, spawned, and born particles need to be distributed in each frame to model the state of the speakers and to estimate jointly the variable number of speakers with their states. In particular, the computational cost is mostly caused by the born particles as they need to be propagated over the entire image in every frame to detect the new speaker presence in the view of the visual tracker. In this paper, we propose to use the audio data to improve the visual SMC-PHD (V-SMC-PHD) filter by using the direction of arrival angles of the audio sources to determine when to propagate the born particles and reallocate the surviving and spawned particles. The tracking accuracy of the audio-visual SMC-PHD (AV-SMC-PHD) algorithm is further improved by using a modified mean-shift algorithm to search and climb density gradients iteratively to find the peak of the probability distribution, and the extra computational complexity introduced by mean-shift is controlled with a sparse sampling technique. These improved algorithms, named as AVMS-SMC-PHD and sparse-AVMS-SMC-PHD, respectively, are compared systematically with AV-SMC-PHD and V-SMC-PHD based on the AV16.3, AMI, and CLEAR datasets.

[1]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Shengtong Zhong,et al.  Hand Tracking by Particle Filtering with Elite Particles Mean Shift , 2008, 2008 Japan-China Joint Workshop on Frontier of Computer Science and Technology.

[3]  John W. McDonough,et al.  Combining multi-source far distance speech recognition strategies: beamforming, blind channel and confusion network combination , 2005, INTERSPEECH.

[4]  Josef Kittler,et al.  Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering , 2015, IEEE Transactions on Multimedia.

[5]  Anoop Gupta,et al.  Automating camera management for lecture room environments , 2001, CHI.

[6]  Josef Kittler,et al.  Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling , 2014, IEEE Transactions on Multimedia.

[7]  Josef Kittler,et al.  Audio informed visual speaker tracking with SMC-PHD filter , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[8]  Volkan Cevher,et al.  Target Tracking Using a Joint Acoustic Video System , 2007, IEEE Transactions on Multimedia.

[9]  Emilio Maggio,et al.  Hybrid particle filter and mean shift tracker with adaptive transition model , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Oswald Lanz,et al.  Approximate Bayesian multibody tracking , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Mohan M. Trivedi,et al.  Audio-Visual Fusion and Tracking With Multilevel Iterative Decoding: Framework and Experimental Evaluation , 2010, IEEE Journal of Selected Topics in Signal Processing.

[12]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[13]  Ba-Ngu Vo,et al.  On performance evaluation of multi-object filters , 2008, 2008 11th International Conference on Information Fusion.

[14]  Ba-Ngu Vo,et al.  A Consistent Metric for Performance Evaluation of Multi-Object Filters , 2008, IEEE Transactions on Signal Processing.

[15]  Rama Chellappa,et al.  Visual tracking and recognition using appearance-adaptive models in particle filters , 2004, IEEE Transactions on Image Processing.

[16]  Jean-Marc Odobez,et al.  Audio-visual speaker tracking with importance particle filters , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[17]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[18]  Yasushi Yagi,et al.  Adaptive Mean-Shift Tracking With Auxiliary Particles , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[19]  A. Doucet,et al.  Sequential Monte Carlo methods for multitarget filtering with random finite sets , 2005, IEEE Transactions on Aerospace and Electronic Systems.

[20]  Y. Bar-Shalom Tracking and data association , 1988 .

[21]  Ba-Ngu Vo,et al.  Tracking an unknown time-varying number of speakers using TDOA measurements: a random finite set approach , 2006, IEEE Transactions on Signal Processing.

[22]  Mustafa Ozden,et al.  A Nonparametric Adaptive Tracking Algorithm Based on Multiple Feature Distributions , 2006, IEEE Transactions on Multimedia.

[23]  Simon J. Godsill,et al.  Acoustic Source Localization and Tracking of a Time-Varying Number of Speakers , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Anthony G. Constantinides,et al.  Audio–Visual Active Speaker Tracking in Cluttered Indoors Environments , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[25]  Ronald P. S. Mahler,et al.  Multitarget miss distance via optimal assignment , 2004, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[26]  Sim Heng Ong,et al.  Tracking Multiple Objects using Probability Hypothesis Density Filter and Color Measurements , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[27]  Tieniu Tan,et al.  Real-time hand tracking using a mean shift embedded particle filter , 2007, Pattern Recognit..

[28]  Tieniu Tan,et al.  Real time hand tracking by combining particle filtering and mean shift , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[29]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[30]  Ba-Ngu Vo,et al.  The Gaussian Mixture Probability Hypothesis Density Filter , 2006, IEEE Transactions on Signal Processing.

[31]  Branko Ristic,et al.  A color-based particle filter for joint detection and tracking of multiple objects , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[32]  Javier Ruiz Hidalgo,et al.  Real-Time Head and Hand Tracking Based on 2.5D Data , 2012 .

[33]  Branko Ristic,et al.  A Metric for Performance Evaluation of Multi-Target Tracking Algorithms , 2011, IEEE Transactions on Signal Processing.

[34]  Josef Kittler,et al.  Audio constrained particle filter based visual tracking , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Michael Isard,et al.  BraMBLe: a Bayesian multiple-blob tracker , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[36]  Takayuki Okatani,et al.  Object tracking by the mean-shift of regional color distribution combined with the particle-filter algorithms , 2004, ICPR 2004.

[37]  Ba-Ngu Vo,et al.  Tracking multiple speakers using random sets , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[39]  Haibin Ling,et al.  Robust Visual Tracking and Vehicle Classification via Sparse Representation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Jean-Marc Odobez,et al.  AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[41]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[42]  M. Kendall Elementary Statistics , 1945, Nature.

[43]  John W. McDonough,et al.  A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[44]  Rashid Ansari,et al.  Kernel particle filter for visual tracking , 2005, IEEE Signal Processing Letters.

[45]  Ling Shao,et al.  Visual Tracking Using Strong Classifier and Structural Local Sparse Descriptors , 2015, IEEE Transactions on Multimedia.

[46]  J. Odobez,et al.  AV 16 . 3 : An Audio-Visual Corpus for Speaker Localization and Tracking , .

[47]  Chalapathy Neti,et al.  Joint audio-visual speech processing for recognition and enhancement , 2003, AVSP.

[48]  Martial Michel,et al.  The CLEAR 2007 Evaluation , 2007, CLEAR.

[49]  Dorin Comaniciu,et al.  Kernel-Based Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  Josef Kittler,et al.  Audio-visual tracking of a variable number of speakers with a random finite set approach , 2014, 17th International Conference on Information Fusion (FUSION).

[51]  Lifeng Sun,et al.  Contextual Mixture Tracking , 2009, IEEE Transactions on Multimedia.

[52]  Volkan Kilic,et al.  Audio-visual tracking of multiple moving speakers. , 2016 .

[53]  Josef Kittler,et al.  Adaptive particle filtering approach to audio-visual tracking , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[54]  Ming-Hsuan Yang,et al.  Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[55]  Jenq-Neng Hwang,et al.  Tracking Human Under Occlusion Based on Adaptive Multiple Kernels With Projected Gradients , 2013, IEEE Transactions on Multimedia.

[56]  A. Hampapur,et al.  Smart video surveillance: exploring the concept of multiscale spatiotemporal tracking , 2005, IEEE Signal Processing Magazine.

[57]  Dieter Fox,et al.  Adapting the Sample Size in Particle Filters Through KLD-Sampling , 2003, Int. J. Robotics Res..

[58]  Guillaume Lathoud,et al.  A sector-based, frequency-domain approach to detection and localization of multiple speakers , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[59]  Ronald P. S. Mahler,et al.  Statistical Multisource-Multitarget Information Fusion , 2007 .

[60]  Yi-Ping Hung,et al.  Adaptive Learning for Target Tracking and True Linking Discovering Across Multiple Non-Overlapping Cameras , 2011, IEEE Transactions on Multimedia.