Catch Me if You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments With Moving Sounds

Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment. While recent approaches have demonstrated the benefits of audio input to detect and find the goal, they focus on clean and static sound sources and struggle to generalize to unheard sounds. In this work, we propose the novel dynamic audio-visual navigation benchmark which requires catching a moving sound source in an environment with noisy and distracting sounds, posing a range of new challenges. We introduce a reinforcement learning approach that learns a robust navigation policy for these complex settings. To achieve this, we propose an architecture that fuses audio-visual information in the spatial feature space to learn correlations of geometric information inherent in both local maps and audio signals. We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments, on two challenging 3D scanned real-world environments, namely Matterport3D and Replica. The benchmark is available at http://dav-nav.cs.uni-freiburg.de.

[1]  Changan Chen,et al.  Sound Adversarial Audio-Visual Navigation , 2022, ICLR.

[2]  Wenhan Luo,et al.  Towards Distraction-Robust Active Visual Tracking , 2021, ICML.

[3]  K. Grauman,et al.  Learning Audio-Visual Dereverberation , 2021, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Kristen Grauman,et al.  Move2Hear: Active Audio-Visual Source Separation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Xin Yu,et al.  VTNet: Visual Transformer Network for Object Goal Navigation , 2021, ICLR.

[6]  Abhinav Valada,et al.  There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Abhinav Valada,et al.  From Learning to Relearning: A Framework for Diminishing Bias in Social Robot Navigation , 2021, Frontiers in Robotics and AI.

[8]  Kristen Grauman,et al.  Audio-Visual Floorplan Reconstruction , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Kristen Grauman,et al.  Semantic Audio-Visual Navigation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Silvio Savarese,et al.  Topological Planning with Transformers for Vision-and-Language Navigation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Santhosh K. Ramakrishnan,et al.  Learning to Set Waypoints for Audio-Visual Navigation , 2020, ICLR.

[12]  Juana Valeria Hurtado,et al.  MOPT: Multi-Object Panoptic Tracking , 2020, ArXiv.

[13]  Ruslan Salakhutdinov,et al.  Learning to Explore using Active Neural SLAM , 2020, ICLR.

[14]  J. Tenenbaum,et al.  Look, Listen, and Act: Towards Audio-Visual Embodied Navigation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[15]  K. Grauman,et al.  SoundSpaces: Audio-Visual Navigation in 3D Environments , 2019, ECCV.

[16]  Ari S. Morcos,et al.  DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames , 2019, ICLR.

[17]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[18]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[19]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[20]  Xinlei Chen,et al.  Multi-Target Embodied Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Stefan Lee,et al.  Embodied Question Answering in Photorealistic Environments With Point Cloud Perception , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Wolfram Burgard,et al.  Robot Localization in Floor Plans Using a Room Layout Edge Extraction Network , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[24]  Silvio Savarese,et al.  A Behavioral Approach to Visual Navigation with Graph Localization Networks , 2019, Robotics: Science and Systems.

[25]  Ali Farhadi,et al.  Visual Semantic Navigation using Scene Priors , 2018, ICLR.

[26]  Monica N. Nicolescu,et al.  Socially-Aware Navigation Using Non-Linear Multi-Objective Optimization , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[27]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[28]  Dan Klein,et al.  Speaker-Follower Models for Vision-and-Language Navigation , 2018, NeurIPS.

[29]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[30]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[31]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[32]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Kristen Grauman,et al.  Learning to Look Around: Intelligently Exploring Unseen Environments for Unknown Tasks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Iván V. Meza,et al.  Localization of sound sources in robotics: A review , 2017, Robotics Auton. Syst..

[35]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[36]  Wolfram Burgard,et al.  Deep spatiotemporal models for robust proprioceptive terrain classification , 2017, Int. J. Robotics Res..

[37]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[38]  Andrew J. Davison,et al.  Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task , 2017, CoRL.

[39]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[40]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[41]  Sergey Levine,et al.  (CAD)$^2$RL: Real Single-Image Flight without a Single Real Image , 2016, Robotics: Science and Systems.

[42]  Jonathan P. How,et al.  Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Alastair H. Moore,et al.  Bearing-only acoustic tracking of moving speakers for robot audition , 2015, 2015 IEEE International Conference on Digital Signal Processing (DSP).

[44]  Kazuhiro Nakadai,et al.  Sound Source Localization and Separation , 2015 .

[45]  Jun Hu,et al.  A new moving target interception algorithm for mobile robots based on sub-goal forecasting and an improved scout ant algorithm , 2013, Appl. Soft Comput..

[46]  Alban Portello,et al.  Active binaural localization of intermittent moving sources in the presence of false measurements , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[47]  Prasanna Velagapudi,et al.  Real-world testing of a multi-robot team , 2012, AAMAS.

[48]  Franco Lepore,et al.  Wayfinding in the blind: larger hippocampal volume and supranormal spatial navigation. , 2008, Brain : a journal of neurology.

[49]  Boumediene Belkhouche,et al.  Parallel navigation for reaching a moving goal by a mobile robot , 2006, Robotica.

[50]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[51]  Jean Rouat,et al.  Localization of simultaneous moving sound sources for mobile robot using a frequency- domain steered beamformer approach , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[52]  Tapio Lokki,et al.  Static and Dynamic Sound Source Localization in a Virtual Room , 2002 .

[53]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.