Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. However, it is non-trivial to manually design a robot controller that combines modalities with very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. We use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. We evaluate our method on a peg insertion task, generalizing over different geometry, configurations, and clearances, while being robust to external perturbations. We present results in simulation and on a real robot.

[1]  Daniel E. Whitney,et al.  Quasi-Static Assembly of Compliantly Supported Rigid Parts , 1982 .

[2]  Daniel E. Whitney,et al.  Historical Perspective and State of the Art in Robot Force Control , 1985, Proceedings. 1985 IEEE International Conference on Robotics and Automation.

[3]  G. Edelman Neural Darwinism: The Theory Of Neuronal Group Selection , 1989 .

[4]  Antonio Bicchi,et al.  Integrated Tactile Sensing for Gripper Fingers , 1988 .

[5]  Warren P. Seering,et al.  Assembly strategies for chamferless parts , 1989, Proceedings, 1989 International Conference on Robotics and Automation.

[6]  Oussama Khatib,et al.  Inertial Properties in Robotic Manipulation: An Object-Level Framework , 1995, Int. J. Robotics Res..

[7]  L. Sentis,et al.  The CHAI Libraries , 2003 .

[8]  Kenith V. Sobel,et al.  PSYCHOLOGICAL SCIENCE Research Article Neural Synergy Between Kinetic Vision and Touch , 2022 .

[9]  Danica Kragic,et al.  Learning tactile characterizations of object- and pose-specific grasps , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Sachin Chitta,et al.  Human-Inspired Robotic Grasp Control With Tactile Sensing , 2011, IEEE Transactions on Robotics.

[11]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[12]  Stefan Schaal,et al.  Learning force control policies for compliant manipulation , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[14]  Danica Kragic,et al.  A probabilistic framework for task-oriented grasp stability assessment , 2013, 2013 IEEE International Conference on Robotics and Automation.

[15]  Oussama Khatib,et al.  A Framework for Real-Time Multi-Contact Multi-Body Dynamic Simulation , 2013, ISRR.

[16]  Connor Schenck,et al.  Learning relational object categories using behavioral exploration and multimodal perception , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Russ Tedrake,et al.  A direct method for trajectory optimization of rigid bodies through contact , 2014, Int. J. Robotics Res..

[18]  Jae-Bok Song,et al.  Automated guidance of peg-in-hole assembly tasks for complex-shaped parts , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Gaurav S. Sukhatme,et al.  An autonomous manipulation system based on force control and optimization , 2014, Auton. Robots.

[20]  Oliver Brock,et al.  Exploitation of environmental constraints in human and robotic grasping , 2015, Int. J. Robotics Res..

[21]  Stefan Schaal,et al.  Data-Driven Online Decision Making for Autonomous Manipulation , 2015, Robotics: Science and Systems.

[22]  Gaurav S. Sukhatme,et al.  Force estimation and slip detection/classification for grip control using a biomimetic tactile sensor , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[23]  Simon Lacey,et al.  Crossmodal and multisensory interactions between vision and touch , 2015, Scholarpedia.

[24]  Jan Peters,et al.  Stabilizing novel objects by learning to predict tactile slip , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[25]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[26]  Jan Peters,et al.  Learning robot in-hand manipulation with tactile features , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[27]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Jimmy A. Jørgensen,et al.  Adaptation of manipulation skills in physical contact with the environment to reference force profiles , 2015, Auton. Robots.

[29]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[30]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[31]  Jan Peters,et al.  Stable reinforcement learning with autoencoders for tactile and visual data , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[32]  Yang Gao,et al.  Deep learning for tactile understanding from visual and haptic data , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[33]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[34]  Sergey Levine,et al.  One-shot learning of manipulation skills with online dynamics adaptation and neural network priors , 2015, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[35]  Alexander Herzog,et al.  A convex model of humanoid momentum dynamics for multi-contact motion generation , 2016, 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids).

[36]  Jitendra Malik,et al.  Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[37]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[38]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[39]  Jiebo Luo,et al.  Deep Multimodal Representation Learning from Temporal Data , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Gorjan Alagic,et al.  #p , 2019, Quantum information & computation.

[41]  Oliver Brock,et al.  Interactive Perception: Leveraging Action in Perception and Perception in Action , 2016, IEEE Transactions on Robotics.

[42]  Stefan Schaal,et al.  Probabilistic Articulated Real-Time Tracking for Robot Manipulation , 2016, IEEE Robotics and Automation Letters.

[43]  Sergey Levine,et al.  Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[44]  Nima Fazeli,et al.  Fundamental Limitations in Performance and Interpretability of Common Planar Rigid-Body Contact Models , 2017, ISRR.

[45]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[46]  Manuela M. Veloso,et al.  Learning End-to-end Multimodal Sensor Policies for Autonomous Navigation , 2017, CoRL.

[47]  John Kenneth Salisbury,et al.  Learning to represent haptic feedback for partially-observable tasks , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[48]  Andrew Owens,et al.  The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes? , 2017, CoRL.

[49]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[50]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[51]  Jitendra Malik,et al.  More Than a Feeling: Learning to Grasp and Regrasp Using Vision and Touch , 2018, IEEE Robotics and Automation Letters.

[52]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[53]  Sergey Levine,et al.  Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[54]  Oussama Khatib,et al.  Experimental Studies of Contact Space Model for Multi-surface Collisions in Articulated Rigid-Body Systems , 2018, ISER.

[55]  Karl Tuyls,et al.  Integrating State Representation Learning Into Deep Reinforcement Learning , 2018, IEEE Robotics and Automation Letters.

[56]  Chonhyon Park,et al.  An Efficient Acyclic Contact Planner for Multiped Robots , 2018, IEEE Transactions on Robotics.

[57]  Nando de Freitas,et al.  Reinforcement and Imitation Learning for Diverse Visuomotor Skills , 2018, Robotics: Science and Systems.

[58]  David Filliat,et al.  State Representation Learning for Control: An Overview , 2018, Neural Networks.

[59]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..