The Audio-Visual BatVision Dataset for Research on Sight and Sound

Vision research showed remarkable success in understanding our world, propelled by datasets of images and videos. Sensor data from radar, LiDAR and cameras supports research in robotics and autonomous driving for at least a decade. However, while visual sensors may fail in some conditions, sound has recently shown potential to complement sensor data. Simulated room impulse responses (RIR) in 3D apartment-models became a benchmark dataset for the community, fostering a range of audiovisual research. In simulation, depth is predictable from sound, by learning bat-like perception with a neural network. Concurrently, the same was achieved in reality by using RGB-D images and echoes of chirping sounds. Biomimicking bat perception is an exciting new direction but needs dedicated datasets to explore the potential. Therefore, we collected the BatVision dataset to provide large-scale echoes in complex real-world scenes to the community. We equipped a robot with a speaker to emit chirps and a binaural microphone to record their echoes. Synchronized RGB-D images from the same perspective provide visual labels of traversed spaces. We sampled modern US office spaces to historic French university grounds, indoor and outdoor with large architectural variety. This dataset will allow research on robot echolocation, general audio-visual tasks and sound phaenomena unavailable in simulated data. We show promising results for audio-only depth prediction and show how state-of-the-art work developed for simulated data can also succeed on our dataset. The data can be downloaded at https://forms.gle/W6xtshMgoXGZDwsE7

[1]  K. Grauman,et al.  Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  L. Gool,et al.  Binaural SoundNet: Predicting Semantics, Depth and Motion With Binaural Sounds , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Lingyu Zhu,et al.  Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision , 2022, ArXiv.

[4]  K. Grauman,et al.  SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning , 2022, NeurIPS.

[5]  T. Virtanen,et al.  STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events , 2022, DCASE.

[6]  B. Schiele,et al.  SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Akisato Kimura,et al.  Co-Attention-Guided Bilinear Model for Echo-Based Depth Estimation , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  M. Tarr,et al.  Learning Neural Acoustic Fields , 2022, NeurIPS.

[9]  James M. Rehg,et al.  Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Zhaoxiang Zhang,et al.  Stereo Depth Estimation with Echoes , 2022, ECCV.

[11]  Andrew Owens,et al.  Structure from Silence: Learning Scene Structure from Ambient Sound , 2021, CoRL.

[12]  Navinda Kottege,et al.  CatChatter: Acoustic Perception for Mobile Robots , 2021, IEEE Robotics and Automation Letters.

[13]  Sharon Gannot,et al.  dEchorate: a calibrated room impulse response dataset for echo-aware signal processing , 2021, EURASIP J. Audio Speech Music. Process..

[14]  Gaurav Sharma,et al.  Beyond Image to Depth: Improving Depth Prediction using Echoes , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Abhinav Valada,et al.  There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Julian F. P. Kooij,et al.  Hearing What You Cannot See: Acoustic Vehicle Detection Around Corners , 2021, IEEE Robotics and Automation Letters.

[17]  Sascha Hornauer,et al.  BatVision with GCC-PHAT Features for Better Sound to Vision Predictions , 2020, ArXiv.

[18]  Kristen Grauman,et al.  VisualEchoes: Spatial Image Representation Learning through Echolocation , 2020, ECCV.

[19]  Luc Van Gool,et al.  Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds , 2020, ECCV.

[20]  K. Grauman,et al.  SoundSpaces: Audio-Visual Navigation in 3D Environments , 2019, ECCV.

[21]  Stella X. Yu,et al.  BatVision: Learning to See 3D Spatial Layout with Two Ears , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Sonia Chernova,et al.  Sim2Real Predictivity: Does Evaluation in Simulation Predict Real-World Performance? , 2019, IEEE Robotics and Automation Letters.

[23]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[24]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[25]  Adrian Hilton,et al.  Acoustic Room Modelling using a Spherical Camera for Reverberant Spatial Audio Objects , 2017 .

[26]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[27]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Peter Vary,et al.  Multichannel audio database in various acoustic environments , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[30]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .