Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors

Using sensor data from multiple modalities presents an opportunity to encode redundant and complementary features that can be useful when one modality is corrupted or noisy. Humans do this everyday, relying on touch and proprioceptive feedback in visually-challenging environments. However, robots might not always know when their sensors are corrupted, as even broken sensors can return valid values. In this work, we introduce the Crossmodal Compensation Model (CCM), which can detect corrupted sensor modalities and compensate for them. CMM is a representation model learned with self-supervision that leverages unimodal reconstruction loss for corruption detection. CCM then discards the corrupted modality and compensates for it with information from the remaining sensors. We show that CCM learns rich state representations that can be used for contact-rich manipulation policies, even when input modalities are corrupted in ways not seen during training time.

[1]  David Filliat,et al.  State Representation Learning for Control: An Overview , 2018, Neural Networks.

[2]  Ruslan Salakhutdinov,et al.  Learning Factorized Multimodal Representations , 2018, ICLR.

[3]  Oliver Brock,et al.  Building Kinematic and Dynamic Models of Articulated Objects with Multi-Modal Interactive Perception , 2017, AAAI Spring Symposia.

[4]  Jeannette Bohg,et al.  Multi-Modal Scene Understanding for Robotic Grasping , 2011 .

[5]  Michelle A. Lee,et al.  Multimodal Sensor Fusion with Differentiable Filters , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Byron Boots,et al.  Joint Inference of Kinematic and Force Trajectories with Visuo-Tactile Sensing , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[8]  Sergey Levine,et al.  Can Autonomous Vehicles Identify, Recover From, and Adapt to Distribution Shifts? , 2020, ICML.

[9]  Kuniyuki Takahashi,et al.  Deep Visuo-Tactile Learning: Estimation of Tactile Properties from Images , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[10]  Mike Wu,et al.  Multimodal Generative Models for Scalable Weakly-Supervised Learning , 2018, NeurIPS.

[11]  Manuel Carreiras,et al.  Cross-modal noise compensation in audiovisual words , 2017, Scientific Reports.

[12]  Yang Gao,et al.  Deep learning for tactile understanding from visual and haptic data , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Raghavendra Chalapathy University of Sydney,et al.  Deep Learning for Anomaly Detection: A Survey , 2019, ArXiv.

[14]  Oliver Brock,et al.  Cross-modal interpretation of multi-modal sensor streams in interactive perception based on coupled recursion , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Barnabás Póczos,et al.  Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[16]  Sergey Levine,et al.  Uncertainty-Aware Reinforcement Learning for Collision Avoidance , 2017, ArXiv.

[17]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Jitendra Malik,et al.  More Than a Feeling: Learning to Grasp and Regrasp Using Vision and Touch , 2018, IEEE Robotics and Automation Letters.

[19]  Danica Kragic,et al.  Learning tactile characterizations of object- and pose-specific grasps , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[20]  Silvio Savarese,et al.  Variable Impedance Control in End-Effector Space: An Action Space for Reinforcement Learning in Contact-Rich Tasks , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[21]  Manuela M. Veloso,et al.  Learning End-to-end Multimodal Sensor Policies for Autonomous Navigation , 2017, CoRL.

[22]  Martina Zambelli,et al.  Multimodal representation models for prediction and control from partial information , 2019, Robotics Auton. Syst..

[23]  Kuan-Ting Yu,et al.  Realtime State Estimation with Tactile and Visual Sensing for Inserting a Suction-held Object , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[24]  P Jenmalm,et al.  Visual and Somatosensory Information about Object Shape Control Manipulative Fingertip Forces , 1997, The Journal of Neuroscience.

[25]  Antonio Torralba,et al.  Connecting Touch and Vision via Cross-Modal Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Harold Soh,et al.  Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series , 2019, AAAI.

[27]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[28]  J. Randall Flanagan,et al.  Coding and use of tactile signals from the fingertips in object manipulation tasks , 2009, Nature Reviews Neuroscience.

[29]  Edward H. Adelson,et al.  3D Shape Perception from Monocular Vision, Touch, and Shape Priors , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[30]  D. Burr,et al.  The Ventriloquist Effect Results from Near-Optimal Bimodal Integration , 2004, Current Biology.

[31]  Danica Kragic,et al.  Strategies for multi-modal scene exploration , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[32]  Silvio Savarese,et al.  Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks , 2019, IEEE Transactions on Robotics.

[33]  Sergey Levine,et al.  Robustness to Out-of-Distribution Inputs via Task-Aware Generative Uncertainty , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[34]  Charles Richter,et al.  Safe Visual Navigation via Deep Learning and Novelty Detection , 2017, Robotics: Science and Systems.

[35]  S. Lomber,et al.  Cross-modal plasticity in specific auditory cortices underlies visual compensations in the deaf , 2010, Nature Neuroscience.

[36]  Connor Schenck,et al.  Learning relational object categories using behavioral exploration and multimodal perception , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[37]  M. Ernst,et al.  Humans integrate visual and haptic information in a statistically optimal fashion , 2002, Nature.

[38]  Jiebo Luo,et al.  Deep Multimodal Representation Learning from Temporal Data , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Silvio Savarese,et al.  Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks , 2018, 2019 International Conference on Robotics and Automation (ICRA).