Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking

Multi-speaker tracking is a central problem in human-robot interaction. In this context, exploiting auditory and visual information is gratifying and challenging at the same time. Gratifying because the complementary nature of auditory and visual information allows us to be more robust against noise and outliers than unimodal approaches. Challenging because how to properly fuse auditory and visual information for multi-speaker tracking is far from being a solved problem. In this paper we propose a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces. Importantly, the method is robust to missing data and is therefore able to track even when observations from one of the modalities are absent. Quantitative and qualitative results on the AVDIAR dataset are reported.

[1]  H.K. Ekenel,et al.  Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[2]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[3]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  V. Šmídl,et al.  The Variational Bayes Method in Signal Processing , 2005 .

[5]  Radu Horaud,et al.  High-dimensional regression with gaussian mixtures and partially-latent response variables , 2013, Statistics and Computing.

[6]  Radu Horaud,et al.  Tracking Multiple Persons Based on a Variational Bayesian Model , 2016, ECCV Workshops.

[7]  Radu Horaud,et al.  A Variational EM Algorithm for the Separation of Time-Varying Convolutive Audio Mixtures , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Emmanuel Vincent,et al.  A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Josef Kittler,et al.  Mean-Shift and Sparse Sampling-Based SMC-PHD Filtering for Audio Informed Visual Speaker Tracking , 2016, IEEE Transactions on Multimedia.

[10]  Radu Horaud,et al.  Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Josef Kittler,et al.  Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering , 2015, IEEE Transactions on Multimedia.

[12]  Radu Horaud,et al.  Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Radu Horaud,et al.  An EM algorithm for joint source separation and diarisation of multichannel convolutive speech mixtures , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[15]  Muhammad Salman Khan,et al.  Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking , 2012, IET Signal Process..

[16]  Radu Horaud,et al.  Vision-guided robot hearing , 2013, Int. J. Robotics Res..

[17]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Yang Liu,et al.  Non-Zero Diffusion Particle Flow SMC-PHD Filter for Audio-Visual Multi-Speaker Tracking , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  James R. Hopgood,et al.  Person tracking via audio and video fusion , 2012 .

[20]  Radu Horaud,et al.  EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  C. Zetzsche,et al.  Information-Driven Active Audio-Visual Source Localization , 2015, PloS one.

[22]  Andrea Cavallaro,et al.  3D audio-visual speaker tracking with an adaptive particle filter , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).