论文信息 - Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration

Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration

In this paper, we introduce Watch-And-Help (WAH), a challenge for testing social intelligence in agents. In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently. To succeed, the AI agent needs to i) understand the underlying goal of the task by watching a single demonstration of the human-like agent performing the same task (social perception), and ii) coordinate with the human-like agent to solve the task in an unseen environment as fast as possible (human-AI collaboration). For this challenge, we build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines. We evaluate the performance of AI agents with the human-like agent as well as with real humans using objective metrics and subjective user ratings. Experimental results demonstrate that the proposed challenge and virtual environment enable a systematic evaluation on the important aspects of machine social intelligence at scale.

[1] Cordelia Schmid,et al. Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos , 2018, ArXiv.

[2] Hector Geffner,et al. Plan Recognition as Planning , 2009, IJCAI.

[3] Anca D. Dragan,et al. On the Utility of Learning about Humans for Human-AI Coordination , 2019, NeurIPS.

[4] Sanja Fidler,et al. VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5] Sarit Kraus,et al. Collaborative Plans for Complex Group Action , 1996, Artif. Intell..

[6] Yi Wu,et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[7] Guy Hoffman,et al. Evaluating Fluency in Human–Robot Collaboration , 2019, IEEE Transactions on Human-Machine Systems.

[8] Greg Mori,et al. A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Andrew Bennett,et al. Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[10] Ali Farhadi,et al. Visual Semantic Planning Using Deep Successor Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11] Jitendra Malik,et al. Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12] Julian Togelius,et al. Pommerman: A Multi-Agent Playground , 2018, AIIDE Workshops.

[13] Yuandong Tian,et al. M^3RL: Mind-aware Multi-agent Management Reinforcement Learning , 2018, ICLR.

[14] H. Francis Song,et al. Machine Theory of Mind , 2018, ICML.

[15] Jitendra Malik,et al. Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Stefan Lee,et al. Embodied Question Answering in Photorealistic Environments With Point Cloud Perception , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Ruslan Salakhutdinov,et al. Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[19] Shimon Whiteson,et al. The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[20] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[21] Song-Chun Zhu,et al. Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Chris L. Baker,et al. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing , 2017, Nature Human Behaviour.

[23] H. Francis Song,et al. The Hanabi Challenge: A New Frontier for AI Research , 2019, Artif. Intell..

[24] Richard E. Korf,et al. Planning as Search: A Quantitative Approach , 1987, Artif. Intell..

[25] Carme Torras,et al. Learning Physical Collaborative Robot Behaviors From Human Demonstrations , 2016, IEEE Transactions on Robotics.

[26] Michael A. Goodrich,et al. Human-Robot Interaction: A Survey , 2008, Found. Trends Hum. Comput. Interact..

[27] Ali Farhadi,et al. IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28] Igor Mordatch,et al. Emergent Tool Use From Multi-Agent Autocurricula , 2019, ICLR.

[29] Silvio Savarese,et al. Understanding Collective Activitiesof People from Videos , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30] Joshua B. Tenenbaum,et al. Theory of Minds: Understanding Behavior in Groups Through Inverse Planning , 2019, AAAI.

[31] Roozbeh Mottaghi,et al. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Yuandong Tian,et al. Building Generalizable Agents with a Realistic and Rich 3D Environment , 2018, ICLR.

[33] Jitendra Malik,et al. From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34] M. Tomasello,et al. Altruistic Helping in Human Infants and Young Chimpanzees , 2006, Science.

[35] Simon Brodeur,et al. HoME: a Household Multimodal Environment , 2017, ICLR.

[36] Richard Socher,et al. Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning , 2017, ICLR.

[37] Silvio Savarese,et al. Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Joshua B. Tenenbaum,et al. Help or Hinder: Bayesian Models of Social Goal Inference , 2009, NIPS.

[39] Song-Chun Zhu,et al. VRKitchen: an Interactive 3D Virtual Environment for Task-oriented Learning , 2019, ArXiv.

[40] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[41] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[42] Guy Lever,et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[43] Martial Hebert,et al. Activity Forecasting , 2012, ECCV.

[44] Stefan Lee,et al. Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[45] Simon M. Lucas,et al. A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[46] Katja Hofmann,et al. The Malmo Platform for Artificial Intelligence Experimentation , 2016, IJCAI.

[47] Kerstin Dautenhahn,et al. Socially intelligent robots: dimensions of human–robot interaction , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[48] Stefanos Nikolaidis,et al. Efficient Model Learning from Joint-Action Demonstrations for Human-Robot Collaborative Tasks , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[49] Igor Mordatch,et al. Neural MMO: A Massively Multiagent Game Environment for Training and Evaluating Intelligent Agents , 2019, ArXiv.

[50] Ali Farhadi,et al. AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[51] Peter Stone,et al. Autonomous agents modelling other agents: A comprehensive survey and open problems , 2017, Artif. Intell..

[52] Henry A. Kautz. A formal theory of plan recognition , 1987 .