The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark Towards Physically Realistic Embodied AI

We introduce a visually-guided and physics-driven taskand-motion planning benchmark, which we call the ThreeDWorld Transport Challenge. In this challenge, an embodied agent equipped with two 9-DOF articulated arms is spawned randomly in a simulated physical home environment. The agent is required to find a small set of objects scattered around the house, pick them up, and transport them to a desired final location. We also position containers around the house that can be used as tools to assist with transporting objects efficiently. To complete the task, an embodied agent must plan a sequence of actions to change the state of a large number of objects in the face of realistic physical constraints. We build this benchmark challenge using the ThreeDWorld simulation: a virtual 3D environment where all objects respond to physics, and where can be controlled using a fully physics-driven navigation and interaction API. We evaluate several existing agents on this benchmark. Experimental results suggest that: 1) a pure RL model struggles on this challenge; 2) hierarchical planningbased agents can transport some objects but still far from solving this task. We anticipate that this benchmark will empower researchers to develop more intelligent physicsdriven robots for the physical world. 1 ar X iv :2 10 3. 14 02 5v 1 [ cs .C V ] 2 5 M ar 2 02 1

[1]  Silvio Savarese,et al.  Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments , 2020, IEEE Robotics and Automation Letters.

[2]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Jitendra Malik,et al.  Combining Optimal Control and Learning for Visual Navigation in Novel Environments , 2019, CoRL.

[4]  Surya P. N. Singh,et al.  V-REP: A versatile and scalable robot simulation framework , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5]  Leslie Pack Kaelbling,et al.  Active Model Learning and Diverse Action Sampling for Task and Motion Planning , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  Santhosh K. Ramakrishnan,et al.  Occupancy Anticipation for Efficient Exploration and Navigation , 2020, ECCV.

[7]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Leslie Pack Kaelbling,et al.  Integrated task and motion planning in belief space , 2013, Int. J. Robotics Res..

[9]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Leslie Pack Kaelbling,et al.  FFRob: An Efficient Heuristic for Task and Motion Planning , 2015, WAFR.

[11]  Ruslan Salakhutdinov,et al.  Object Goal Navigation using Goal-Oriented Semantic Exploration , 2020, NeurIPS.

[12]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[13]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[14]  Joshua B. Tenenbaum,et al.  Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning , 2021, ICLR.

[15]  Dinesh Manocha,et al.  Path Planning among Movable Obstacles: A Probabilistically Complete Approach , 2008, WAFR.

[16]  Ruslan Salakhutdinov,et al.  Learning to Explore using Active Neural SLAM , 2020, ICLR.

[17]  Yuandong Tian,et al.  Building Generalizable Agents with a Realistic and Rich 3D Environment , 2018, ICLR.

[18]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Stuart J. Russell,et al.  Combined Task and Motion Planning for Mobile Manipulation , 2010, ICAPS.

[20]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[21]  Pieter Abbeel,et al.  Combined task and motion planning through an extensible planner-independent interface layer , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Chuang Gan,et al.  CLEVRER: CoLlision Events for Video REpresentation and Reasoning , 2020, ICLR.

[24]  Murray Shanahan,et al.  The Animal-AI Environment: Training and Testing Animal-Like Artificial Cognition , 2019, ArXiv.

[25]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Kristen Grauman,et al.  SoundSpaces: Audio-Visual Navigation in 3D Environments , 2020, ECCV.

[27]  Leonidas J. Guibas,et al.  SAPIEN: A SimulAted Part-Based Interactive ENvironment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Brian Yamauchi,et al.  A frontier-based approach for autonomous exploration , 1997, Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97. 'Towards New Computational Principles for Robotics and Automation'.

[29]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[30]  Sergey Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[31]  Roozbeh Mottaghi,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Xin Wang,et al.  Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation , 2018, ECCV.

[33]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[34]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Roozbeh Mottaghi,et al.  Rearrangement: A Challenge for Embodied AI , 2020, ArXiv.

[36]  Andrew J. Davison,et al.  RLBench: The Robot Learning Benchmark & Learning Environment , 2019, IEEE Robotics and Automation Letters.

[37]  Vladlen Koltun,et al.  Beauty and the Beast: Optimal Methods Meet Learning for Drone Racing , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[38]  Chuang Gan,et al.  Look, Listen, and Act: Towards Audio-Visual Embodied Navigation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[39]  James Traer,et al.  A PERCEPTUALLY INSPIRED GENERATIVE MODEL OF RIGID-BODY CONTACT SOUNDS , 2019 .

[40]  Leslie Pack Kaelbling,et al.  Hierarchical task and motion planning in the now , 2011, 2011 IEEE International Conference on Robotics and Automation.

[41]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[42]  James J. Kuffner,et al.  Planning Among Movable Obstacles with Artificial Constraints , 2008, WAFR.

[43]  Tamim Asfour,et al.  Manipulation Planning Among Movable Obstacles , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[44]  Silvio Savarese,et al.  ReLMoGen: Leveraging Motion Generation in Reinforcement Learning for Mobile Manipulation , 2020, ArXiv.

[45]  Chuang Gan,et al.  ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation , 2020, ArXiv.

[46]  Ali Farhadi,et al.  RoboTHOR: An Open Simulation-to-Real Embodied AI Platform , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Leslie Pack Kaelbling,et al.  Integrated Task and Motion Planning , 2020, Annu. Rev. Control. Robotics Auton. Syst..