Embodied Symbiotic Assistants that See, Act, Infer and Chat

We present Symbiote, an embodied home assistant that maps images from its camera into objects and rooms, builds geometric semantic maps, parses human instructions and conversations into user intents and their arguments, explores in a goal-directed way to find relevant objects (if not present in the map) and executes the inferred actions plans using its navigation and manipulation policies, and/or ask questions to clarify intents and arguments. Our main contribution is a hybrid approach to the semantic parsing of user instructions and their mapping to suitable action routines. We propose a text-to-text neural encoder-decoder language parsing model that maps user instructions to sequences of simplified utterances. The generated utterances are then mapped to parameterized action primitives to execute by a rule-based parser. Our neural parser benefits from large-scale text-to-text unsupervised language pre-training, and our rule-based parser effectively covers the domain of simplified single-step instructions that our neural model generates. Training our neural parser to map language utterances directly to parameterized action programs would not work as the output space would be much outside the text domain that the neural model has been pre-trained on. We present ablations and evaluations of different modules of our agent. We discuss our failure models which are mostly related to a lack of accurate referential object instance grounding, instruction parsing, and perception failures. We outline current and future experiments and research directions in the realms of open-vocabulary spatio-temporal 2D and 3D perception, memory-augmented vision-language parsing networks to handle continual learning without forgetting, and fast and few-shot learning during deployment and interaction with human users. We also discuss our present conversational strategies and how we plan to make them

[1]  Katerina Fragkiadaki,et al.  Analogy-Forming Transformers for Few-Shot 3D Parsing , 2023, ICLR.

[2]  Dilek Z. Hakkani-Tür,et al.  Alexa Arena: A User-Centric Interactive Platform for Embodied AI , 2023, ArXiv.

[3]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[4]  Li Dong,et al.  Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[5]  Adam W. Harley,et al.  TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors , 2022, ECCV.

[6]  Adam W. Harley,et al.  Simple-BEV: What Really Matters for Multi-Sensor BEV Perception? , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Liunian Harold Li,et al.  GLIPv2: Unifying Localization and Vision-Language Understanding , 2022, 2206.05836.

[8]  Zirui Wang,et al.  CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[9]  Katerina Fragkiadaki,et al.  Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds , 2021, ECCV.

[10]  Diego de Las Casas,et al.  Improving language models by retrieving from trillions of tokens , 2021, ICML.

[11]  Devendra Singh Chaplot,et al.  FILM: Following Instructions in Language with Modular Methods , 2021, ICLR.

[12]  Dilek Z. Hakkani-Tür,et al.  TEACh: Task-driven Embodied Agents that Chat , 2021, AAAI.

[13]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[14]  Alessandro Suglia,et al.  Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion , 2021, ArXiv.

[15]  Chi-Keung Tang,et al.  Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation , 2021, NeurIPS.

[16]  Roozbeh Mottaghi,et al.  Visual Room Rearrangement , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Joshua B. Tenenbaum,et al.  The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark Towards Physically Realistic Embodied AI , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[18]  Lyne P. Tchapmi,et al.  iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[19]  Roozbeh Mottaghi,et al.  Rearrangement: A Challenge for Embodied AI , 2020, ArXiv.

[20]  Santhosh K. Ramakrishnan,et al.  Occupancy Anticipation for Efficient Exploration and Navigation , 2020, ECCV.

[21]  Josh H. McDermott,et al.  ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation , 2020, NeurIPS Datasets and Benchmarks.

[22]  Ruslan Salakhutdinov,et al.  Object Goal Navigation using Goal-Oriented Semantic Exploration , 2020, NeurIPS.

[23]  Arjun Gupta,et al.  Semantic Visual Navigation by Watching YouTube Videos , 2020, NeurIPS.

[24]  Abhinav Gupta,et al.  Semantic Curiosity for Active Visual Learning , 2020, ECCV.

[25]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[26]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[27]  Ruslan Salakhutdinov,et al.  Learning to Explore using Active Neural SLAM , 2020, ICLR.

[28]  Luke Zettlemoyer,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ari S. Morcos,et al.  DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames , 2019, ICLR.

[30]  S. Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[31]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[32]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[33]  Katerina Fragkiadaki,et al.  Learning from Unlabelled Videos Using Contrastive Predictive Neural 3D Mapping , 2019, ICLR.

[34]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Tao Chen,et al.  Learning Exploration Policies for Navigation , 2019, ICLR.

[36]  Katerina Fragkiadaki,et al.  Learning Spatial Common Sense With Geometry-Aware Recurrent Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ali Farhadi,et al.  Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Silvio Savarese,et al.  SURREAL: Open-Source Reinforcement Learning Framework and Robot Manipulation Benchmark , 2018, CoRL.

[39]  Ali Farhadi,et al.  Visual Semantic Navigation using Scene Priors , 2018, ICLR.

[40]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[41]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[42]  Daniel L. K. Yamins,et al.  Learning to Play with Intrinsically-Motivated Self-Aware Agents , 2018, NeurIPS.

[43]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[44]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Georgia Gkioxari,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[46]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Liang Wang,et al.  Referring Expression Generation and Comprehension via Attributes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Alvin Cheung,et al.  Learning a Neural Semantic Parser from User Feedback , 2017, ACL.

[50]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[52]  Demis Hassabis,et al.  Neural Episodic Control , 2017, ICML.

[53]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[54]  Sebastian Nowozin,et al.  DeepCoder: Learning to Write Programs , 2016, ICLR.

[55]  Quoc V. Le,et al.  Learning a Natural Language Interface with Neural Programmer , 2016, ICLR.

[56]  Chen Liang,et al.  Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision , 2016, ACL.

[57]  Noah A. Smith,et al.  Greedy, Joint Syntactic-Semantic Parsing with Stack LSTMs , 2016, CoNLL.

[58]  Percy Liang,et al.  Learning executable semantic parsers for natural language understanding , 2016, Commun. ACM.

[59]  Amos Azaria,et al.  Instructable Intelligent Personal Agent , 2016, AAAI.

[60]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[61]  Mirella Lapata,et al.  Language to Logical Form with Neural Attention , 2016, ACL.

[62]  Luke S. Zettlemoyer,et al.  Question-Answer Driven Semantic Role Labeling: Using Natural Language to Annotate Natural Language , 2015, EMNLP.

[63]  Ming-Wei Chang,et al.  Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base , 2015, ACL.

[64]  Wei Xu,et al.  End-to-end learning of semantic role labeling using recurrent neural networks , 2015, ACL.

[65]  Jonathan Berant,et al.  Building a Semantic Parser Overnight , 2015, ACL.

[66]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[67]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[68]  N. McGlynn Thinking fast and slow. , 2014, Australian veterinary journal.

[69]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[70]  Tom M. Mitchell,et al.  Joint Syntactic and Semantic Parsing with Combinatory Categorial Grammar , 2014, ACL.

[71]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[72]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[73]  Ming-Wei Chang,et al.  Driving Semantic Parsing from the World’s Response , 2010, CoNLL.

[74]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[75]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[76]  Raymond J. Mooney,et al.  Learning to Parse Database Queries Using Inductive Logic Programming , 1996, AAAI/IAAI, Vol. 2.

[77]  John D. Burger,et al.  Problems in Natural-Language Interface to DBMS With Examples From EUFID , 1983, ANLP.

[78]  David L. Waltz,et al.  An English language question answering system for a large relational database , 1978, CACM.

[79]  Gary G. Hendrix,et al.  Developing a natural language interface to complex data , 1977, TODS.

[80]  F. B. Thompson,et al.  REL: A Rapidly Extensible Language system , 1969, ACM '69.

[81]  H. Eysenck,et al.  Thinking movement and the creation of dance through numbers , 2006 .

[82]  Adam W. Harley,et al.  Move to See Better: Self-Improving Embodied Object Detection , 2021, BMVC.

[83]  Ali Farhadi,et al.  Learning Generalizable Visual Representations via Interactive Gameplay , 2021, ICLR.

[84]  Luke S. Zettlemoyer,et al.  Learning to Parse Natural Language Commands to a Robot Control System , 2012, ISER.

[85]  Alexander J. Smola,et al.  Neural Information Processing Systems , 1997, NIPS 1997.

[86]  I. Miyazaki,et al.  AND T , 2022 .