Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360 degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of a vision `teacher' method and a sound `student' method -- the student method is trained to generate the same results as the teacher method. This way, the auditory system can be trained without using human annotations. We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound Super-resolution to increase the spatial resolution of sounds, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results on the dataset show that 1) our method achieves promising results for semantic prediction and the two auxiliary tasks; and 2) the three tasks are mutually beneficial -- training them together achieves the best performance and 3) the number and orientations of microphones are both important. The data and code will be released to facilitate the research in this new direction.

[1]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Luc Van Gool,et al.  Object Referring in Visual Scene with Spoken Language , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  Ashutosh Saxena,et al.  Learning sound location from a single microphone , 2009, 2009 IEEE International Conference on Robotics and Automation.

[4]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[5]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[6]  William Whittaker,et al.  Autonomous driving in urban environments: Boss and the Urban Challenge , 2008, J. Field Robotics.

[7]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[8]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[10]  Lawrence D. Rosenblum,et al.  Echolocating Distance by Moving and Stationary Listeners , 2000 .

[11]  Dingzeyu Li,et al.  Scene-aware audio for 360° videos , 2018, ACM Trans. Graph..

[12]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[13]  H. Wallach,et al.  The role of head movements and vestibular and visual cues in sound localization. , 1940 .

[14]  Russell L. Martin,et al.  Sound localization with head movement: implications for 3-d audio displays , 2014, Front. Neurosci..

[15]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[16]  Marie-Francine Moens,et al.  Talk2Car: Taking Control of Your Self-Driving Car , 2019, EMNLP.

[17]  Luc Van Gool,et al.  Revisiting Multi-Task Learning in the Deep Learning Era , 2020, ArXiv.

[18]  H. Bülthoff,et al.  Merging the senses into a robust percept , 2004, Trends in Cognitive Sciences.

[19]  Sidney S. Simon,et al.  Merging of the Senses , 2008, Front. Neurosci..

[20]  Federico Domínguez,et al.  SoundCompass: A Distributed MEMS Microphone Array-Based Sensor for Sound Source Localization , 2014, Sensors.

[21]  Benjamin Höferlin,et al.  Evaluation of background subtraction techniques for video surveillance , 2011, CVPR 2011.

[22]  Emanuel A. P. Habets,et al.  Inference of Room Geometry From Acoustic Impulse Responses , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Weidong Huang,et al.  Human Factors in Augmented Reality Environments , 2012, Springer New York.

[24]  Alan L. Yuille,et al.  Towards unified depth and semantic prediction from a single image , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Dinesh Manocha,et al.  3D Reconstruction in the presence of glasses by acoustic and stereo fusion , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hirokazu Kameoka,et al.  Seeing through Sounds: Predicting Visual Semantic Segmentation Results from Multichannel Audio Signals , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Roland Siegwart,et al.  The current state and future outlook of rescue robotics , 2019, J. Field Robotics.

[29]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[30]  Jana Kosecka,et al.  Joint Semantic Segmentation and Depth Estimation with Deep Convolutional Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Alejandro Cartas,et al.  Seeing and Hearing Egocentric Actions: How Much Can We Learn? , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[33]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Bruno Fazenda,et al.  Acoustic based safety emergency vehicle detection for intelligent transport systems , 2009, 2009 ICCAS-SICE.

[36]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  John W. McDonough,et al.  Kalman Filters for Time Delay of Arrival-Based Source Localization , 2005, EURASIP J. Adv. Signal Process..

[38]  William W. Gaver What in the World Do We Hear? An Ecological Approach to Auditory Event Perception , 1993 .

[39]  Chuang Gan,et al.  Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[41]  Martin Vetterli,et al.  Acoustic echoes reveal room shape , 2013, Proceedings of the National Academy of Sciences.

[42]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Adrian Hilton,et al.  3D Room Geometry Reconstruction Using Audio-Visual Sensors , 2017, 2017 International Conference on 3D Vision (3DV).

[44]  Philippe Souères,et al.  A survey on sound source localization in robotics: From binaural to array processing methods , 2015, Comput. Speech Lang..

[45]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Paul Hurley,et al.  DeepWave: A Recurrent Neural-Network for Real-Time Acoustic Imaging , 2019, NeurIPS.

[47]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[48]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[49]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[50]  Luc Van Gool,et al.  End-to-End Learning of Driving Models with Surround-View Cameras and Route Planners , 2018, ECCV.

[51]  Ingmar Posner,et al.  Leveraging the urban soundscape: Auditory perception for smart vehicles , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[52]  Iván V. Meza,et al.  Localization of sound sources in robotics: A review , 2017, Robotics Auton. Syst..

[53]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[54]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[55]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[56]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[57]  William Whittaker,et al.  Autonomous driving in urban environments: Boss and the Urban Challenge , 2008, J. Field Robotics.

[58]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).