Toward Grounded Social Reasoning

Consider a robot tasked with tidying a desk with a meticulously constructed Lego sports car. A human may recognize that it is not socially appropriate to disassemble the sports car and put it away as part of the"tidying". How can a robot reach that conclusion? Although large language models (LLMs) have recently been used to enable social reasoning, grounding this reasoning in the real world has been challenging. To reason in the real world, robots must go beyond passively querying LLMs and *actively gather information from the environment* that is required to make the right decision. For instance, after detecting that there is an occluded car, the robot may need to actively perceive the car to know whether it is an advanced model car made out of Legos or a toy car built by a toddler. We propose an approach that leverages an LLM and vision language model (VLM) to help a robot actively perceive its environment to perform grounded social reasoning. To evaluate our framework at scale, we release the MessySurfaces dataset which contains images of 70 real-world surfaces that need to be cleaned. We additionally illustrate our approach with a robot on 2 carefully designed surfaces. We find an average 12.9% improvement on the MessySurfaces benchmark and an average 15% improvement on the robot experiments over baselines that do not use active perception. The dataset, code, and videos of our approach can be found at https://minaek.github.io/groundedsocialreasoning.

[1]  Boyang Li,et al.  InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , 2023, NeurIPS.

[2]  Andy Zeng,et al.  TidyBot: Personalized Robot Assistance with Large Language Models , 2023, ArXiv.

[3]  Dorsa Sadigh,et al.  Language Instructed Reinforcement Learning for Human-AI Coordination , 2023, ICML.

[4]  Ken Goldberg,et al.  LERF: Language Embedded Radiance Fields , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Muhammad Burhan Hafez,et al.  Chat with the Environment: Interactive Multimodal Perception using Large Language Models , 2023, ArXiv.

[6]  Carl Vondrick,et al.  ViperGPT: Visual Inference via Python Execution for Reasoning , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Harold Soh,et al.  Large Language Models as Zero-Shot Human Models for Human-Robot Interaction , 2023, ArXiv.

[8]  Mehdi S. M. Sajjadi,et al.  PaLM-E: An Embodied Multimodal Language Model , 2023, ICML.

[9]  Sang Michael Xie,et al.  Reward Design with Language Models , 2023, ICLR.

[10]  S. Levine,et al.  RT-1: Robotics Transformer for Real-World Control at Scale , 2022, Robotics: Science and Systems.

[11]  Dorsa Sadigh,et al.  Few-Shot Preference Learning for Human-in-the-Loop RL , 2022, CoRL.

[12]  Andy Zeng,et al.  Visual Language Maps for Robot Navigation , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Animesh Garg,et al.  See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction , 2022, ArXiv.

[14]  Li Fei-Fei,et al.  VIMA: General Robot Manipulation with Multimodal Prompts , 2022, ArXiv.

[15]  B. Schölkopf,et al.  When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment , 2022, NeurIPS.

[16]  Peter R. Florence,et al.  Code as Policies: Language Model Programs for Embodied Control , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Peter R. Florence,et al.  Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[18]  S. Levine,et al.  LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action , 2022, CoRL.

[19]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[20]  Anima Anandkumar,et al.  MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge , 2022, NeurIPS.

[21]  Kathleen C. Fraser,et al.  Does Moral Code have a Moral Code? Probing Delphi’s Moral Philosophy , 2022, TRUSTNLP.

[22]  Yejin Choi,et al.  Aligning to Social Norms and Values in Interactive Narratives , 2022, NAACL.

[23]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[24]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[25]  P. Abbeel,et al.  Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[26]  Dorsa Sadigh,et al.  Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences , 2020, Int. J. Robotics Res..

[27]  Reid G. Simmons,et al.  INQUIRE: INteractive Querying for User-aware Informative REasoning , 2022, Conference on Robot Learning.

[28]  David Wingate,et al.  Leveraging the Inductive Bias of Large Language Models for Abstract Textual Reasoning , 2021, NeurIPS.

[29]  Yejin Choi,et al.  CommonsenseQA 2.0: Exposing the Limits of AI through Gamification , 2021, NeurIPS Datasets and Benchmarks.

[30]  Dorsa Sadigh,et al.  Learning Human Objectives from Sequences of Physical Corrections , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[31]  Dorsa Sadigh,et al.  ELLA: Exploration through Learned Language Abstraction , 2021, NeurIPS.

[32]  D. Song,et al.  Aligning AI With Shared Human Values , 2020, ICLR.

[33]  Jeannette Bohg,et al.  Concept2Robot: Learning manipulation concepts from instructions and human demonstrations , 2020, Robotics: Science and Systems.

[34]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[35]  Leyang Cui,et al.  Evaluating Commonsense in Pre-trained Language Models , 2019, AAAI.

[36]  Tom B. Brown,et al.  Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[37]  Scott Niekum,et al.  Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations , 2019, CoRL.

[38]  Dorsa Sadigh,et al.  Learning Reward Functions by Integrating Human Demonstrations and Preferences , 2019, Robotics: Science and Systems.

[39]  Olivier Stasse,et al.  The Pinocchio C++ library : A fast and flexible implementation of rigid body dynamics algorithms and their analytical derivatives , 2019, 2019 IEEE/SICE International Symposium on System Integration (SII).

[40]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[42]  Anca D. Dragan,et al.  Active Preference-Based Learning of Reward Functions , 2017, Robotics: Science and Systems.

[43]  Oliver Brock,et al.  Interactive Perception: Leveraging Action in Perception and Perception in Action , 2016, IEEE Transactions on Robotics.

[44]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[45]  Michèle Sebag,et al.  APRIL: Active Preference-learning based Reinforcement Learning , 2012, ECML/PKDD.

[46]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[47]  W. Marsden I and J , 2012 .

[48]  Siddhartha S. Srinivasa,et al.  Human preferences for robot-human hand-over configurations , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[49]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[50]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[51]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.