Deep Metric Learning-Assisted 3D Audio-Visual Speaker Tracking via Two-Layer Particle Filter

For speaker tracking, integrating multimodal information from audio and video provides an effective and promising solution. The current challenges are focused on the construction of a stable observation model. To this end, we propose a 3D audio-visual speaker tracker assisted by deep metric learning on the two-layer particle filter framework. Firstly, the audio-guided motion model is applied to generate candidate samples in the hierarchical structure consisting of an audio layer and a visual layer. Then, a stable observation model is proposed with a designed Siamese network, which provides the similarity-based likelihood to calculate particle weights. The speaker position is estimated using an optimal particle set, which integrates the decisions from audio particles and visual particles. Finally, the long short-term mechanism-based template update strategy is adopted to prevent drift during tracking. Experimental results demonstrate that the proposed method outperforms the single-modal trackers and comparison methods. Efficient and robust tracking is achieved both in 3D space and on image plane.

[1]  Andreas Wendemuth,et al.  Multi-modal fusion with particle filter for speaker localization and tracking , 2011, 2011 International Conference on Multimedia Technology.

[2]  Andrea Cavallaro,et al.  3D audio-visual speaker tracking with an adaptive particle filter , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  John W. McDonough,et al.  A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[4]  Oswald Lanz,et al.  Multi-Speaker Tracking From an Audio–Visual Sensing Device , 2019, IEEE Transactions on Multimedia.

[5]  Josef Kittler,et al.  Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering , 2015, IEEE Transactions on Multimedia.

[6]  Josef Kittler,et al.  Audio constrained particle filter based visual tracking , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Hong Liu,et al.  3D Audio-Visual Speaker Tracking with A Two-Layer Particle Filter , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[9]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[10]  Jean-Marc Odobez,et al.  AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[11]  Guillaume Lathoud,et al.  A sector-based, frequency-domain approach to detection and localization of multiple speakers , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Andrea Cavallaro,et al.  3D Mouth Tracking from a Compact Microphone Array Co-Located with a camera , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  H.K. Ekenel,et al.  Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[14]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[15]  Lei Han,et al.  Deep learning assisted robust visual tracking with adaptive particle filtering , 2018, Signal Process. Image Commun..

[16]  Wenwu Wang,et al.  Audio–Visual Particle Flow SMC-PHD Filtering for Multi-Speaker Tracking , 2020, IEEE Transactions on Multimedia.

[17]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[18]  Hong Liu,et al.  Multiple Concurrent Sound Source Tracking Based on Observation-Guided Adaptive Particle Filter , 2018, INTERSPEECH.

[19]  V. G. Reju,et al.  Swarm Intelligence Based Particle Filter for Alternating Talker Localization and Tracking Using Microphone Arrays , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Radu Horaud,et al.  Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).