DASH: Modularized Human Manipulation Simulation with Vision and Language for Embodied AI

Creating virtual humans with embodied, human-like perceptual and actuation constraints has the promise to provide an integrated simulation platform for many scientific and engineering applications. We present Dynamic and Autonomous Simulated Human (DASH), an embodied virtual human that, given natural language commands, performs grasp-and-stack tasks in a physically-simulated cluttered environment solely using its own visual perception, proprioception, and touch, without requiring human motion data. By factoring the DASH system into a vision module, a language module, and manipulation modules of two skill categories, we can mix and match analytical and machine learning techniques for different modules so that DASH is able to not only perform randomly arranged tasks with a high success rate, but also do so under anthropomorphic constraints and with fluid and diverse motions. The modular design also favors analysis and extensibility to more complex manipulation skills.

[1]  Steven M. LaValle,et al.  RRT-connect: An efficient approach to single-query path planning , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).

[2]  Silvio Savarese,et al.  Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments , 2020, IEEE Robotics and Automation Letters.

[3]  Antonio Bicchi,et al.  Hands for dexterous manipulation and robust grasping: a difficult road toward simplicity , 2000, IEEE Trans. Robotics Autom..

[4]  Roozbeh Mottaghi,et al.  AllenAct: A Framework for Embodied AI Research , 2020, ArXiv.

[5]  Yuval Tassa,et al.  Reusable neural skill embeddings for vision-guided whole body movement and object manipulation , 2019, ArXiv.

[6]  Hung-Hsuan Huang,et al.  Embodied Conversational Agents , 2009 .

[7]  Dan Klein,et al.  Learning Dependency-Based Compositional Semantics , 2011, CL.

[8]  Rui Liu,et al.  A review of methodologies for natural-language-facilitated human–robot cooperation , 2017, International Journal of Advanced Robotic Systems.

[9]  John Funge,et al.  Cognitive modeling: knowledge, reasoning and planning for intelligent characters , 1999, SIGGRAPH.

[10]  Allison M. Okamura,et al.  An overview of dexterous manipulation , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).

[11]  Kuniyuki Takahashi,et al.  Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Xiaolong Wang,et al.  State-Only Imitation Learning for Dexterous Manipulation , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[13]  Eliyahu Kiperwasser,et al.  Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations , 2016, TACL.

[14]  C. Karen Liu,et al.  Synthesis of detailed hand manipulations using contact sampling , 2012, ACM Trans. Graph..

[15]  C. Karen Liu,et al.  Dextrous manipulation from a grasping pose , 2009, ACM Trans. Graph..

[16]  Demetri Terzopoulos,et al.  Active Perception in Virtual Humans , 2000 .

[17]  Yuyu Xu,et al.  Fast, automatic character animation pipelines , 2014, Comput. Animat. Virtual Worlds.

[18]  Dieter Fox,et al.  Prospection: Interpretable plans from language by predicting the future , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[19]  N. Heess,et al.  Catch & Carry: Reusable Neural Controllers for Vision-Guided Whole-Body Tasks , 2019 .

[20]  S. Buss Introduction to Inverse Kinematics with Jacobian Transpose , Pseudoinverse and Damped Least Squares methods , 2004 .

[21]  John R. Anderson,et al.  ACT-R: A Theory of Higher Level Cognition and Its Relation to Visual Attention , 1997, Hum. Comput. Interact..

[22]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[23]  Sergey Levine,et al.  Deep Dynamics Models for Learning Dexterous Manipulation , 2019, CoRL.

[24]  Aaron M. Dollar,et al.  On dexterity and dexterous manipulation , 2011, 2011 15th International Conference on Advanced Robotics (ICAR).

[25]  Mohit Shridhar,et al.  Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction , 2018, Robotics: Science and Systems.

[26]  Tameem Antoniades Creating a live real-time performance-captured digital human , 2016, SIGGRAPH Real-Time Live!.

[27]  Kai Xu,et al.  Learning Generative Models of 3D Structures , 2020, Eurographics.

[28]  Yong-Liang Yang,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[29]  Sheldon Andrews,et al.  Policies for Goal Directed Multi-Finger Manipulation , 2012, VRIPHYS.

[30]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[31]  A. Hamilton,et al.  How can the study of action kinematics inform our understanding of human social interaction? , 2017, Neuropsychologia.

[32]  C. Karen Liu,et al.  Interactive synthesis of human-object interaction , 2009, SCA '09.

[33]  Ran Zhao,et al.  Socially-Aware Animated Intelligent Personal Assistant Agent , 2016, SIGDIAL Conference.

[34]  Dinesh K. Pai,et al.  Interaction capture and synthesis , 2005, SIGGRAPH 2005.

[35]  Victor Ng-Thow-Hing,et al.  Fast smoothing of manipulator trajectories using optimal bounded-acceleration shortcuts , 2010, 2010 IEEE International Conference on Robotics and Automation.

[36]  SFV , 2018, ACM Transactions on Graphics.

[37]  Michael Gasser,et al.  The Development of Embodied Cognition: Six Lessons from Babies , 2005, Artificial Life.

[38]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[39]  Roozbeh Mottaghi,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yuval Tassa,et al.  Catch & Carry , 2020, ACM Trans. Graph..

[41]  Norman I. Badler,et al.  The EMOTE model for effort and shape , 2000, SIGGRAPH.

[42]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[43]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[44]  Jiajun Wu,et al.  Neural Scene De-rendering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[46]  C. Karen Liu,et al.  Assistive Gym: A Physics Simulation Framework for Assistive Robotics , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Daniel Thalmann,et al.  Believable Virtual Characters in Human-Computer Dialogs , 2011, Eurographics.

[48]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[49]  Choh Man Teng,et al.  Building and Learning Structures in a Situated Blocks World Through Deep Language Understanding , 2018 .

[50]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[51]  Tao Zhou,et al.  Deep learning of biomimetic sensorimotor control for biomechanical human animation , 2018, ACM Trans. Graph..

[52]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Taku Komura,et al.  Spatial relationship preserving character motion adaptation , 2010, SIGGRAPH 2010.

[54]  Percy Liang,et al.  From Language to Programs: Bridging Reinforcement Learning and Maximum Marginal Likelihood , 2017, ACL.

[55]  Nadia Magnenat-Thalmann,et al.  Interactive Virtual Humans in Real-Time Virtual Environments , 2006, Int. J. Virtual Real..

[56]  Stacy Marsella,et al.  SmartBody: behavior realization for embodied conversational agents , 2008, AAMAS.

[57]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[58]  Carol O'Sullivan,et al.  Synthetic Vision and Memory for Autonomous Virtual Humans , 2002, Comput. Graph. Forum.

[59]  Ludovic Righetti,et al.  Leveraging Contact Forces for Learning to Grasp , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[60]  Jinxiang Chai,et al.  Robust realtime physics-based motion control for human grasping , 2013, ACM Trans. Graph..

[61]  Daniel Thalmann,et al.  Real-Time Animation of Realistic Virtual Humans , 1998, IEEE Computer Graphics and Applications.

[62]  Victor B. Zordan,et al.  Physically based grasping control from example , 2005, SCA '05.

[63]  Zoran Popovic,et al.  Contact-invariant optimization for hand manipulation , 2012, SCA '12.

[64]  Stacy Marsella,et al.  Nonverbal Behavior Generator for Embodied Conversational Agents , 2006, IVA.

[65]  Anton Leuski,et al.  All Together Now - Introducing the Virtual Human Toolkit , 2013, IVA.

[66]  Kyoungmin Lee,et al.  Scalable muscle-actuated human simulation and control , 2019, ACM Trans. Graph..

[67]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Daniel Thalmann,et al.  Virtual Humanoids: Let Them be Autonomous without Losing Control , 2000 .

[69]  Ari Shapiro,et al.  Building a Character Animation System , 2011, MIG.

[70]  Katsu Yamane,et al.  Synthesizing animations of human manipulation tasks , 2004, ACM Trans. Graph..

[71]  Takeo Kanade,et al.  Automated Construction of Robotic Manipulation Programs , 2010 .

[72]  Truong-Huy D. Nguyen,et al.  Modeling Warmth and Competence in Virtual Characters , 2015, IVA.

[73]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Gordon Wetzstein,et al.  State of the Art on Neural Rendering , 2020, Comput. Graph. Forum.

[75]  Todd M. Gureckis,et al.  Question Asking as Program Generation , 2017, NIPS.

[76]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[77]  W. Lewis Johnson,et al.  Animated Agents for Procedural Training in Virtual Reality: Perception, Cognition, and Motor Control , 1999, Appl. Artif. Intell..

[78]  NohJunyong,et al.  Model Predictive Control with a Visuomotor System for Physics-based Character Animation , 2020 .

[79]  Ronan Boulic,et al.  Bringing the human arm reachable space to a virtual environment for its analysis , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[80]  Edmond S. L. Ho,et al.  Spatial relationship preserving character motion adaptation , 2010, ACM Trans. Graph..

[81]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[82]  Qionghai Dai,et al.  Video-based hand manipulation capture through composite motion control , 2013, ACM Trans. Graph..

[83]  Jitendra Malik,et al.  SFV , 2018, ACM Trans. Graph..