论文信息 - Deep Metric Learning-Assisted 3D Audio-Visual Speaker Tracking via Two-Layer Particle Filter

Deep Metric Learning-Assisted 3D Audio-Visual Speaker Tracking via Two-Layer Particle Filter

For speaker tracking, integrating multimodal information from audio and video provides an effective and promising solution. The current challenges are focused on the construction of a stable observation model. To this end, we propose a 3D audio-visual speaker tracker assisted by deep metric learning on the two-layer particle filter framework. Firstly, the audio-guided motion model is applied to generate candidate samples in the hierarchical structure consisting of an audio layer and a visual layer. Then, a stable observation model is proposed with a designed Siamese network, which provides the similarity-based likelihood to calculate particle weights. The speaker position is estimated using an optimal particle set, which integrates the decisions from audio particles and visual particles. Finally, the long short-term mechanism-based template update strategy is adopted to prevent drift during tracking. Experimental results demonstrate that the proposed method outperforms the single-modal trackers and comparison methods. Efficient and robust tracking is achieved both in 3D space and on image plane.

Hong Liu | Yang Chen | Bing Yang | Runwei Ding | Yidi Li

[1] Andreas Wendemuth,et al. Multi-modal fusion with particle filter for speaker localization and tracking , 2011, 2011 International Conference on Multimedia Technology.

[2] Andrea Cavallaro,et al. 3D audio-visual speaker tracking with an adaptive particle filter , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] John W. McDonough,et al. A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[4] Oswald Lanz,et al. Multi-Speaker Tracking From an Audio–Visual Sensing Device , 2019, IEEE Transactions on Multimedia.

[5] Josef Kittler,et al. Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering , 2015, IEEE Transactions on Multimedia.

[6] Josef Kittler,et al. Audio constrained particle filter based visual tracking , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Jean-Marc Odobez,et al. Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8] Hong Liu,et al. 3D Audio-Visual Speaker Tracking with A Two-Layer Particle Filter , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[9] Yann LeCun,et al. Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[10] Jean-Marc Odobez,et al. AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[11] Guillaume Lathoud,et al. A sector-based, frequency-domain approach to detection and localization of multiple speakers , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12] Andrea Cavallaro,et al. 3D Mouth Tracking from a Compact Microphone Array Co-Located with a camera , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] H.K. Ekenel,et al. Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[14] Mohan S. Kankanhalli,et al. Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[15] Lei Han,et al. Deep learning assisted robust visual tracking with adaptive particle filtering , 2018, Signal Process. Image Commun..

[16] Wenwu Wang,et al. Audio–Visual Particle Flow SMC-PHD Filtering for Multi-Speaker Tracking , 2020, IEEE Transactions on Multimedia.

[17] Neil J. Gordon,et al. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[18] Hong Liu,et al. Multiple Concurrent Sound Source Tracking Based on Observation-Guided Adaptive Particle Filter , 2018, INTERSPEECH.

[19] V. G. Reju,et al. Swarm Intelligence Based Particle Filter for Alternating Talker Localization and Tracking Using Microphone Arrays , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20] Radu Horaud,et al. Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).