Multiple Speaker Tracking in Spatial Audio via PHD Filtering and Depth-Audio Fusion

In the object-based spatial audio system, positions of the audio objects (e.g., speakers/talkers or voices) presented in the sound scene are required as important metadata attributes for object acquisition and reproduction. Binaural microphones are often used as a physical device to mimic human hearing and to monitor and analyze the scene, including localization and tracking of multiple speakers. The binaural audio tracker, however, is usually prone to the errors caused by room reverberation and background noise. To address this limitation, we present a multimodal tracking method by fusing the binaural audio with depth information (from a depth sensor, e.g., Kinect). More specifically, the probability hypothesis density (PHD) filtering framework is first applied to the depth stream, and a novel clutter intensity model is proposed to improve the robustness of the PHD filter when an object is occluded either by other objects or due to the limited field of view of the depth sensor. To compensate misdetections in the depth stream, a novel gap filling technique is presented to map audio azimuths obtained from the binaural audio tracker to 3D positions, using speaker-dependent spatial constraints learned from the depth stream. With our proposed method, both the errors in the binaural tracker and the misdetections in the depth tracker can be significantly reduced. Real-room recordings are used to show the improved performance of the proposed method in removing outliers and reducing misdetections.

[1]  Silvio Savarese,et al.  Detecting and tracking people using an RGB-D camera via multiple detector fusion , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[2]  Eric A. Lehmann,et al.  Particle Filter Design Using Importance Sampling for Acoustic Source Localisation and Tracking in Reverberant Environments , 2006, EURASIP J. Adv. Signal Process..

[3]  Josef Kittler,et al.  Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling , 2014, IEEE Transactions on Multimedia.

[4]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[5]  Simon J. Godsill,et al.  Acoustic Source Localization and Tracking of a Time-Varying Number of Speakers , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  A. Doucet,et al.  Sequential Monte Carlo methods for multitarget filtering with random finite sets , 2005, IEEE Transactions on Aerospace and Electronic Systems.

[7]  Adrian Hilton,et al.  Identity association using PHD filters in multiple head tracking with depth sensors , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Ba-Ngu Vo,et al.  Tracking an unknown time-varying number of speakers using TDOA measurements: a random finite set approach , 2006, IEEE Transactions on Signal Processing.

[9]  Enrico Pagello,et al.  Probabilistic 2D Acoustic Source Localization Using Direction of Arrivals in Robot Sensor Networks , 2014, SIMPAR.

[10]  Frank Melchior,et al.  An Audio-Visual System for Object-Based Audio: From Recording to Listening , 2018, IEEE Transactions on Multimedia.

[11]  Frank Melchior,et al.  Spatial Sound With Loudspeakers and Its Perception: A Review of the Current State , 2013, Proceedings of the IEEE.

[12]  Ba-Ngu Vo,et al.  Adaptive Target Birth Intensity for PHD and CPHD Filters , 2012, IEEE Transactions on Aerospace and Electronic Systems.

[13]  Azriel Rosenfeld,et al.  Tracking Groups of People , 2000, Comput. Vis. Image Underst..

[14]  Tim Brookes,et al.  Production and Reproduction of Program Material for a Variety of Spatial Audio Formats , 2015 .

[15]  Xin Li,et al.  Pedestrian detection and tracking in infrared imagery using shape and appearance , 2007, Comput. Vis. Image Underst..

[16]  Shaogang Gong,et al.  Tracking and segmenting people in varying lighting conditions using colour , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[17]  Frank Melchior,et al.  On object based audio with reverberation , 2016 .

[18]  Jonathan W. Decker,et al.  Performance measurements for the Microsoft Kinect skeleton , 2012, 2012 IEEE Virtual Reality Workshops (VRW).

[19]  A. Blake,et al.  Sequential Monte Carlo fusion of sound and vision for speaker tracking , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[20]  James J. Little,et al.  Optimizing Multiple Object Tracking and Best View Video Synthesis , 2008, IEEE Transactions on Multimedia.

[21]  Josef Kittler,et al.  Mean-Shift and Sparse Sampling-Based SMC-PHD Filtering for Audio Informed Visual Speaker Tracking , 2016, IEEE Transactions on Multimedia.

[22]  Kai Oliver Arras,et al.  People detection in RGB-D data , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Carlos Morato,et al.  Toward Safe Human Robot Collaboration by Using Multiple Kinects Based Real-Time Human Tracking , 2014, J. Comput. Inf. Sci. Eng..

[24]  Ramani Duraiswami,et al.  Microphone Arrays as Generalized Cameras for Integrated Audio Visual Processing , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[26]  Josef Kittler,et al.  Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering , 2015, IEEE Transactions on Multimedia.

[27]  Xin Yu,et al.  Object Tracking With Multi-View Support Vector Machines , 2015, IEEE Transactions on Multimedia.

[28]  Wolfram Burgard,et al.  Efficient people tracking in laser range data using a multi-hypothesis leg-tracker with adaptive occlusion probabilities , 2008, 2008 IEEE International Conference on Robotics and Automation.

[29]  Ruigang Yang,et al.  Accurate 3D pose estimation from a single depth image , 2011, 2011 International Conference on Computer Vision.

[30]  Kai Oliver Arras,et al.  People tracking in RGB-D data with on-line boosted target models , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Ba-Ngu Vo,et al.  The Gaussian Mixture Probability Hypothesis Density Filter , 2006, IEEE Transactions on Signal Processing.

[32]  Upkar Varshney,et al.  Pervasive Healthcare and Wireless Health Monitoring , 2007, Mob. Networks Appl..

[33]  Radu Horaud,et al.  Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[35]  Chris Pike,et al.  MEASUREMENT AND ANALYSIS OF A SPATIALLY SAMPLED BINAURAL ROOM IMPULSE RESPONSE DATASET , 2014 .

[36]  Andrew Blake,et al.  Nonlinear filtering for speaker tracking in noisy and reverberant environments , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[37]  Dan Schonfeld,et al.  Real-Time Distributed Multi-Object Tracking Using Multiple Interactive Trackers and a Magnetic-Inertia Potential Model , 2007, IEEE Transactions on Multimedia.

[38]  Mohan M. Trivedi,et al.  Mutual information based registration of multimodal stereo videos for person tracking , 2007, Comput. Vis. Image Underst..

[39]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[40]  Zhengyou Zhang,et al.  Microsoft Kinect Sensor and Its Effect , 2012, IEEE Multim..

[41]  Sergio Escalera,et al.  Tri-modal Person Re-identification with RGB, Depth and Thermal Features , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[42]  Reid G. Simmons,et al.  Multimodal person tracking and attention classification , 2006, HRI '06.

[43]  R. Mahler Multitarget Bayes filtering via first-order multitarget moments , 2003 .

[44]  Nacer Abouchi,et al.  Preliminary results on algorithms for multi-kinect trajectory fusion in a living lab , 2015 .

[45]  Feng Lian,et al.  Estimating Unknown Clutter Intensity for PHD Filter , 2010, IEEE Transactions on Aerospace and Electronic Systems.

[46]  Adrian Hilton,et al.  Person Tracking Using Audio and Depth Cues , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[47]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  日向 俊二 Kinect for Windowsアプリを作ろう , 2012 .