Audio Visual Language Maps for Robot Navigation

While interacting in the world is a multi-sensory experience, many robots continue to predominantly rely on visual perception to map and navigate in their environments. In this work, we propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues. AVLMaps integrate the open-vocabulary capabilities of multimodal foundation models pre-trained on Internet-scale data by fusing their features into a centralized 3D voxel grid. In the context of navigation, we show that AVLMaps enable robot systems to index goals in the map based on multimodal queries, e.g., textual descriptions, images, or audio snippets of landmarks. In particular, the addition of audio information enables robots to more reliably disambiguate goal locations. Extensive experiments in simulation show that AVLMaps enable zero-shot multimodal goal navigation from multimodal prompts and provide 50% better recall in ambiguous scenarios. These capabilities extend to mobile robots in the real world - navigating to landmarks referring to visual, audio, and spatial concepts. Videos and code are available at: https://avlmaps.github.io.

[1]  Krishna Murthy Jatavallabhula,et al.  ConceptFusion: Open-set Multimodal 3D Mapping , 2023, Robotics: Science and Systems.

[2]  Benjamin Elizalde,et al.  CLAP: Learning Audio Concepts From Natural Language Supervision , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  T. Welschehold,et al.  Catch Me if You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments With Moving Sounds , 2021, IEEE Robotics and Automation Letters.

[4]  T. Funkhouser,et al.  OpenScene: 3D Scene Understanding with Open Vocabularies , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  A. Roy-Chowdhury,et al.  AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments , 2022, NeurIPS.

[6]  Andy Zeng,et al.  Visual Language Maps for Robot Navigation , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Arthur D. Szlam,et al.  CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory , 2022, Robotics: Science and Systems.

[8]  Jessica Borja-Diaz,et al.  Grounding Language with Visual Affordances over Unstructured Data , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[9]  M. Ryoo,et al.  Open-vocabulary Queryable Scene Representations for Real World Planning , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[10]  J. Boedecker,et al.  Latent Plans for Task-Agnostic Offline Reinforcement Learning , 2022, CoRL.

[11]  Peter R. Florence,et al.  Code as Policies: Language Model Programs for Embodied Control , 2022, ArXiv.

[12]  S. Levine,et al.  LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action , 2022, CoRL.

[13]  Josh H. McDermott,et al.  Finding Fallen Objects Via Asynchronous Audio-Visual Integration , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Oier Mees,et al.  What Matters in Language Conditioned Robotic Imitation Learning Over Unstructured Data , 2022, IEEE Robotics and Automation Letters.

[15]  Kilian Q. Weinberger,et al.  Language-driven Semantic Segmentation , 2022, ICLR.

[16]  W. Burgard,et al.  CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks , 2021, IEEE Robotics and Automation Letters.

[17]  J. Bello,et al.  Wav2CLIP: Learning Robust Audio Representations from Clip , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Federico Raue,et al.  Audioclip: Extending Clip to Image, Text and Audio , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yin Cui,et al.  Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , 2021, ICLR.

[20]  Ludwig Schmidt,et al.  CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration , 2022, ArXiv.

[21]  Dieter Fox,et al.  CLIPort: What and Where Pathways for Robotic Manipulation , 2021, CoRL.

[22]  Cordelia Schmid,et al.  Airbert: In-domain Pretraining for Vision-and-Language Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Angel X. Chang,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, NeurIPS.

[24]  Yann LeCun,et al.  MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[26]  Santhosh K. Ramakrishnan,et al.  Learning to Set Waypoints for Audio-Visual Navigation , 2020, ICLR.

[27]  Josh H. McDermott,et al.  ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation , 2020, NeurIPS Datasets and Benchmarks.

[28]  Golnaz Ghiasi,et al.  Open-Vocabulary Image Segmentation , 2021, ArXiv.

[29]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[30]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[31]  Jacob Krantz,et al.  Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments , 2020, ECCV.

[32]  J. Tenenbaum,et al.  Look, Listen, and Act: Towards Audio-Visual Embodied Navigation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[33]  K. Grauman,et al.  SoundSpaces: Audio-Visual Navigation in 3D Environments , 2019, ECCV.

[34]  Tomasz Malisiewicz,et al.  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Binbin Xu,et al.  MID-Fusion: Octree-based Object-Level Multi-Instance Dynamic SLAM , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[37]  Roland Siegwart,et al.  From Coarse to Fine: Robust Hierarchical Localization at Large Scale , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Stefan Leutenegger,et al.  Fusion++: Volumetric Object-Level SLAM , 2018, 2018 International Conference on 3D Vision (3DV).

[39]  Dan Klein,et al.  Speaker-Follower Models for Vision-and-Language Navigation , 2018, NeurIPS.

[40]  Lourdes Agapito,et al.  MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects , 2018, 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[41]  Tomasz Malisiewicz,et al.  SuperPoint: Self-Supervised Interest Point Detection and Description , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[42]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[45]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[46]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[48]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Morgan Quigley,et al.  ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[50]  Konrad Paul Kording,et al.  Causal Inference in Multisensory Perception , 2007, PloS one.

[51]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.