Multi-Modal Localization and Enhancement of Multiple Sound Sources from a Micro Aerial Vehicle

The ego-noise generated by the motors and propellers of a micro aerial vehicle (MAV) masks the environmental sounds and considerably degrades the quality of the on-board sound recording. Sound enhancement approaches generally require knowledge of the direction of arrival of the target sound sources, which are difficult to estimate due to the low signal-to-noise-ratio (SNR) caused by the ego-noise and the interferences between multiple sources. To address this problem, we propose a multi-modal analysis approach that jointly exploits audio and video to enhance the sounds of multiple targets captured from an MAV equipped with a microphone array and a video camera. We first address audio-visual calibration via camera resectioning, audio-visual temporal alignment and geometrical alignment to jointly use the features in the audio and video streams, which are independently generated. The spatial information from the video is used to assist sound enhancement by tracking multiple potential sound sources with a particle filter. Then we infer the directions of arrival of the target sources from the video tracking results and extract the sound from the desired direction with a time-frequency spatial filter, which suppresses the ego-noise by exploiting its time-frequency sparsity. Experimental demonstration results with real outdoor data verify the robustness of the proposed multi-modal approach for multiple speakers in extremely low-SNR scenarios.

[1]  Andrea Cavallaro,et al.  Microphone-Array Ego-Noise Reduction Algorithms for Auditory Micro Aerial Vehicles , 2017, IEEE Sensors Journal.

[2]  Jian Wang,et al.  Visual-information-assisted microphone array processing in a high-noise environment , 1998, Other Conferences.

[3]  Makoto Kumon,et al.  Design model of microphone arrays for multirotor helicopters , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  Andrea Cavallaro,et al.  Ear in the sky: Ego-noise reduction for auditory micro aerial vehicles , 2016, 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[5]  Fan Yang,et al.  Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Saeid Sanei,et al.  Video assisted speech source separation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Andrea Cavallaro,et al.  Time-frequency processing for sound source localization from a micro aerial vehicle , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Zhengyou Zhang,et al.  A Flexible New Technique for Camera Calibration , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Satoshi Uemura,et al.  Outdoor Acoustic Event Identification using Sound Source Separation and Deep Learning with a Quadrotor-Embedded Microphone Array , 2015 .

[10]  Álvaro García-Martín,et al.  Hierarchical detection of persons in groups , 2017, Signal Image Video Process..

[11]  Wongun Choi,et al.  Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Joshua D. Reiss,et al.  Over-Determined Source Separation and Localization Using Distributed Microphones , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Lin Wang,et al.  Multi-band multi-centroid clustering based permutation alignment for frequency-domain blind speech separation , 2014, Digit. Signal Process..

[14]  Keisuke Nakamura,et al.  Outdoor auditory scene analysis using a moving microphone array embedded in a quadrocopter , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  R. Mahler A Theoretical Foundation for the Stein-Winter "Probability Hypothesis Density (PHD)" Multitarget Tracking Approach , 2000 .

[16]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Fabio Poiesi,et al.  Online Multi-target Tracking with Strong and Weak Detections , 2016, ECCV Workshops.

[18]  Sumeetpal S. Singh,et al.  Sequential monte carlo implementation of the phd filter for multi-target tracking , 2003, Sixth International Conference of Information Fusion, 2003. Proceedings of the.

[19]  Keisuke Nakamura,et al.  Improvement in outdoor sound source detection using a quadrotor-embedded microphone array , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[20]  Fahad Shahbaz Khan,et al.  Color attributes for object detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Janne Heikkilä,et al.  A four-step camera calibration procedure with implicit image correction , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Pietro Perona,et al.  Fast Feature Pyramids for Object Detection , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Thomas Mauthner,et al.  Occlusion Geodesics for Online Multi-object Tracking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Marc Moonen,et al.  GSVD-based optimal filtering for single and multimicrophone speech enhancement , 2002, IEEE Trans. Signal Process..

[25]  Mubarak Shah,et al.  Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Joshua D. Reiss,et al.  An Iterative Approach to Source Counting and Localization Using Two Distant Microphones , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Lin Wang,et al.  Noise Power Spectral Density Estimation Using MaxNSR Blocking Matrix , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  ZhangZhengyou A Flexible New Technique for Camera Calibration , 2000 .

[31]  Shuicheng Yan,et al.  An HOG-LBP human detector with partial occlusion handling , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[32]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Sunggeun Yoo,et al.  Advanced sound capturing method with adaptive noise reduction system for broadcasting multicopters , 2015, 2015 IEEE International Conference on Consumer Electronics (ICCE).

[34]  Muhammad Salman Khan,et al.  Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking , 2012, IET Signal Process..

[35]  R. Mahler Multitarget Bayes filtering via first-order multitarget moments , 2003 .