VirtualHome: Simulating Household Activities Via Programs

In this paper, we are interested in modeling complex activities that occur in a typical household. We propose to use programs, i.e., sequences of atomic actions and interactions, as a high level representation of complex tasks. Programs are interesting because they provide a non-ambiguous representation of a task, and allow agents to execute them. However, nowadays, there is no database providing this type of information. Towards this goal, we first crowd-source programs for a variety of activities that happen in people's homes, via a game-like interface used for teaching kids how to code. Using the collected dataset, we show how we can learn to extract programs directly from natural language descriptions or from videos. We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to "drive" an artificial agent to execute tasks in a simulated household environment. Our VirtualHome simulator allows us to create a large activity video dataset with rich ground-truth, enabling training and testing of video understanding models. We further showcase examples of our agent performing tasks in our VirtualHome based on language descriptions.

[1]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[2]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[3]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[4]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[5]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[6]  Simon Brodeur,et al.  HoME: a Household Multimodal Environment , 2017, ICLR.

[7]  Yuandong Tian,et al.  Building Generalizable Agents with a Realistic and Rich 3D Environment , 2018, ICLR.

[8]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[10]  S. Moss Listen , 2017 .

[11]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Michael Beetz,et al.  Everything robots always wanted to know about housework (but were afraid to ask) , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Benjamin Kuipers,et al.  Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions , 2006, AAAI.

[14]  Silvio Savarese,et al.  Unsupervised Semantic Parsing of Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[17]  Matthew R. Walter,et al.  Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.

[18]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[19]  Marc Brockschmidt,et al.  Neural Program Lattices , 2016, ICLR.

[20]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[21]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[24]  Yiannis Aloimonos,et al.  Manipulation action tree bank: A knowledge resource for humanoids , 2014, 2014 IEEE-RAS International Conference on Humanoid Robots.

[25]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Ashish Kapoor,et al.  Aerial Informatics and Robotics Platform , 2017 .

[27]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[28]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Guido Bugmann,et al.  Training Personal Robots Using Natural Language Instruction , 2001, IEEE Intell. Syst..

[30]  Thomas A. Funkhouser,et al.  MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments , 2017, ArXiv.

[31]  Raymond J. Mooney,et al.  Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes , 2015, ACL.

[32]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).