SoundSpaces: Audio-Visual Navigation in 3D Environments

Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to (1) discover elements of the geometry of the physical space indicated by the reverberating audio and (2) detect and follow sound-emitting targets. We further introduce SoundSpaces: a first-of-its-kind dataset of audio renderings based on geometrical acoustic simulations for two sets of publicly available 3D environments (Matterport3D and Replica), and we instrument Habitat to support the new sensor, making it possible to insert arbitrary sound sources in an array of real-world scanned environments. Our results show that audio greatly benefits embodied visual navigation in 3D spaces, and our work lays groundwork for new research in embodied AI with audio-visual perception.

[1]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Iliyan Georgiev,et al.  Implementing Vertex Connection and Merging , 2013 .

[3]  Christos Dimitrakakis,et al.  TORCS, The Open Racing Car Simulator , 2005 .

[4]  Martina S. Ragettli,et al.  Differences between Outdoor and Indoor Sound Levels for Open, Tilted, and Closed Windows , 2018, International journal of environmental research and public health.

[5]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  E. Tolman Cognitive maps in rats and men. , 1948, Psychological review.

[7]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[8]  Radu Horaud,et al.  Accounting for Room Acoustics in Audio-Visual Multi-Speaker Tracking , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Ruslan Salakhutdinov,et al.  Learning to Explore using Active Neural SLAM , 2020, ICLR.

[10]  Stefan Lee,et al.  Decentralized Distributed PPO: Mastering PointGoal Navigation , 2020, ICLR 2020.

[11]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[12]  Jaime Sánchez,et al.  Development of an audio-based virtual gaming environment to assist with navigation skills in the blind. , 2013, Journal of visualized experiments : JoVE.

[13]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[14]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[15]  Chuang Gan,et al.  Look, Listen, and Act: Towards Audio-Visual Embodied Navigation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Hiroshi G. Okuno,et al.  Automatic speech recognition improved by two-layered audio-visual integration for robot audition , 2009, 2009 9th IEEE-RAS International Conference on Humanoid Robots.

[17]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Daniela Massiceti,et al.  Stereosonic vision: Exploring visual-to-auditory sensory substitution mappings in an immersive virtual reality navigation paradigm , 2018, PloS one.

[19]  Stefan Lee,et al.  Embodied Question Answering in Photorealistic Environments With Point Cloud Perception , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Vladlen Koltun,et al.  Benchmarking Classic and Learned Navigation in Complex 3D Environments , 2019, ArXiv.

[21]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Franco Lepore,et al.  Early- and Late-Onset Blind Individuals Show Supra-Normal Auditory Abilities in Far-Space , 2004, Current Biology.

[23]  Arne D. Ekstrom,et al.  Why vision is important to how we navigate , 2015, Hippocampus.

[24]  Lauri Savioja,et al.  Overview of geometrical room acoustic modeling techniques. , 2015, The Journal of the Acoustical Society of America.

[25]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[26]  Stefan Lee,et al.  Neural Modular Control for Embodied Question Answering , 2018, CoRL.

[27]  Patrick A. Naylor,et al.  Acoustic SLAM , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Antonio Torralba,et al.  See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.

[29]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  R. Zatorre,et al.  A Functional Neuroimaging Study of Sound Localization: Visual Cortex Activity Predicts Performance in Early-Blind Individuals , 2005, PLoS biology.

[31]  Jana Kosecka,et al.  A dataset for developing and benchmarking active vision , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Simon Brodeur,et al.  HoME: a Household Multimodal Environment , 2017, ICLR.

[33]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[34]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[35]  Tao Chen,et al.  Learning Exploration Policies for Navigation , 2019, ICLR.

[36]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[37]  Kazuhiro Nakadai,et al.  Sound Source Localization and Separation , 2015 .

[38]  Radu Horaud,et al.  Vision-guided robot hearing , 2013, Int. J. Robotics Res..

[39]  Stephen R. Clark,et al.  Probing Emergent Semantics in Predictive Agents via Question Answering , 2020, ICML.

[40]  Dhruv Batra,et al.  SplitNet: Sim2Sim and Task2Task Transfer for Embodied Visual Navigation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Ali Farhadi,et al.  Visual Semantic Planning Using Deep Successor Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  L. Merabet,et al.  Neural reorganization following sensory loss: the opportunity of change , 2010, Nature Reviews Neuroscience.

[43]  Silvio Savarese,et al.  Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[44]  Yangsheng Xu,et al.  Surveillance Robot Utilizing Video and Audio Information , 2009, J. Intell. Robotic Syst..

[45]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Ruslan Salakhutdinov,et al.  Learning To Explore Using Active Neural Mapping , 2020, ICLR 2020.

[47]  Ali Farhadi,et al.  Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[49]  S. Hillyard,et al.  Improved auditory spatial tuning in blind humans , 1999, Nature.

[50]  Kristen Grauman,et al.  VisualEchoes: Spatial Image Representation Learning through Echolocation , 2020, ECCV.

[51]  Chenliang Xu,et al.  Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.

[52]  Abraham Woubie,et al.  Do Autonomous Agents Benefit from Hearing? , 2019, ArXiv.

[53]  Hiroaki Kitano,et al.  Epipolar geometry based sound localization and extraction for humanoid audition , 2001, Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180).

[54]  M. Paré,et al.  Early-blind human subjects localize sound sources better than sighted subjects , 1998, Nature.

[55]  Silvio Savarese,et al.  Interactive Gibson: A Benchmark for Interactive Navigation in Cluttered Environments , 2019, ArXiv.

[56]  Norman I. Badler,et al.  Sound localization and multi-modal steering for autonomous virtual agents , 2014, I3D.

[57]  Boaz Rafaely,et al.  Fundamentals of Spherical Array Processing , 2015, Springer Topics in Signal Processing.

[58]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[59]  Ali Farhadi,et al.  A Cordial Sync: Going Beyond Marginal Policies for Multi-Agent Embodied Tasks , 2020, ECCV.

[60]  Kristen Grauman,et al.  End-to-End Policy Learning for Active Visual Categorization , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[62]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  K. Heinrich Kuttruff,et al.  Auralization of Impulse Responses Modeled on the Basis of Ray-Tracing Results , 1993 .

[64]  Kristen Grauman,et al.  2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Lotfi B. Merabet,et al.  Audio-Based Navigation Using Virtual Environments: Combining Technology and Neuroscience , 2009 .

[66]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[67]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68]  Dinesh Manocha,et al.  Interactive sound propagation with bidirectional path tracing , 2016, ACM Trans. Graph..

[69]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[71]  Sebastian Thrun,et al.  Probabilistic robotics , 2002, CACM.

[72]  Jitendra Malik,et al.  Unifying Map and Landmark Based Representations for Visual Navigation , 2017, ArXiv.

[73]  Wojciech Jaskowski,et al.  ViZDoom: A Doom-based AI research platform for visual reinforcement learning , 2016, 2016 IEEE Conference on Computational Intelligence and Games (CIG).

[74]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[75]  Abhinav Gupta,et al.  PyRobot: An Open-source Robotics Framework for Research and Benchmarking , 2019, ArXiv.

[76]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[77]  Rob Fergus,et al.  MazeBase: A Sandbox for Learning from Games , 2015, ArXiv.

[78]  Leonidas J. Guibas,et al.  Bidirectional Estimators for Light Transport , 1995 .

[79]  Radu Horaud,et al.  Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[80]  Hiroaki Kitano,et al.  Active Audition for Humanoid , 2000, AAAI/IAAI.

[81]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[82]  Yangsheng Xu,et al.  A Learning Based Approach to Audio Surveillance in Household Environment , 2006, Int. J. Inf. Acquis..

[83]  C. Thinus-Blanc,et al.  Representation of space in blind persons: vision as a spatial sense? , 1997, Psychological bulletin.

[84]  Rob Fergus,et al.  Learning Physical Intuition of Block Towers by Example , 2016, ICML.

[85]  Jerome Daniel,et al.  Spatial Sound Encoding Including Near Field Effect: Introducing Distance Coding Filters and a Viable, New Ambisonic Format , 2003 .

[86]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[87]  Ali Farhadi,et al.  Two Body Problem: Collaborative Visual Task Completion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Franz Zotter,et al.  Ambisonics: A Practical 3D Audio Theory for Recording, Studio Production, Sound Reinforcement, and Virtual Reality , 2019 .

[89]  Yuandong Tian,et al.  Bayesian Relational Memory for Semantic Visual Navigation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[90]  Massimo Bergamasco,et al.  The Design and Evaluation of a Computer Game for the Blind in the GRAB Haptic Audio Virtual Environment , 2003 .

[91]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[92]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[93]  Andrea Vedaldi,et al.  MapNet: An Allocentric Spatial Memory for Mapping Environments , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[94]  Rick Kazman,et al.  Using 3D sound as a navigational aid in virtual environments , 2004, Behav. Inf. Technol..

[95]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[96]  Adrián Romero-Garcés,et al.  Audio-Visual Perception System for a Humanoid Robotic Head , 2014, Sensors.

[97]  Katja Hofmann,et al.  The Malmo Platform for Artificial Intelligence Experimentation , 2016, IJCAI.

[98]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[99]  Franco Lepore,et al.  Wayfinding in the blind: larger hippocampal volume and supranormal spatial navigation. , 2008, Brain : a journal of neurology.

[100]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[101]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[102]  Radu Horaud,et al.  Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[103]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[104]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[105]  Michel Denis,et al.  Exploration of architectural spaces by blind people using auditory virtual reality for the construction of spatial knowledge , 2014, Int. J. Hum. Comput. Stud..

[106]  Jia Deng,et al.  To Learn or Not to Learn: Analyzing the Role of Learning for Navigation in Virtual Environments , 2019, ArXiv.

[107]  H. Sabine Room Acoustics , 1953, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[108]  Joseph M. Romano,et al.  ROS open-source audio recognizer: ROAR environmental sound detection tools for robot programming , 2013, Autonomous Robots.

[109]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[110]  Iván V. Meza,et al.  Localization of sound sources in robotics: A review , 2017, Robotics Auton. Syst..

[111]  Robert Höldrich,et al.  Binaural rendering of Ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. , 2018, The Journal of the Acoustical Society of America.

[112]  Subramanian Ramanathan,et al.  SALSA: A Novel Dataset for Multimodal Group Behavior Analysis , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[113]  Yonatan Bisk,et al.  Shifting the Baseline: Single Modality Performance on Visual Navigation & QA , 2018, NAACL.