Connecting Touch and Vision via Cross-Modal Prediction

Humans perceive the world using multi-modal sensory inputs such as vision, audition, and touch. In this work, we investigate the cross-modal connection between vision and touch. The main challenge in this cross-domain modeling task lies in the significant scale discrepancy between the two: while our eyes perceive an entire visual scene at once, humans can only feel a small region of an object at any given moment. To connect vision and touch, we introduce new tasks of synthesizing plausible tactile signals from visual inputs as well as imagining how we interact with objects given tactile data as input. To accomplish our goals, we first equip robots with both visual and tactile sensors and collect a large-scale dataset of corresponding vision and tactile image sequences. To close the scale gap, we present a new conditional adversarial model that incorporates the scale and location information of the touch. Human perceptual studies demonstrate that our model can produce realistic visual images from tactile data and vice versa. Finally, we present both qualitative and quantitative experimental results regarding different system designs, as well as visualizing the learned representations of our model.

[1]  Naokazu Yokoya,et al.  Learning Joint Representations of Videos and Sentences with Web Image Search , 2016, ECCV Workshops.

[2]  Véronique Perdereau,et al.  Tactile sensing in dexterous robot hands - Review , 2015, Robotics Auton. Syst..

[3]  S. Lederman,et al.  Perception of texture by vision and touch: multidimensionality and intersensory integration. , 1986, Journal of experimental psychology. Human perception and performance.

[4]  Edward H. Adelson,et al.  Sensing and Recognizing Surface Textures Using a GelSight Sensor , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Edward H. Adelson,et al.  Measurement of shear and slip with a GelSight tactile sensor , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Jan Kautz,et al.  Unsupervised Image-to-Image Translation Networks , 2017, NIPS.

[7]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[8]  Edward H. Adelson,et al.  Microgeometry capture using an elastomeric sensor , 2011, ACM Trans. Graph..

[9]  Edward H. Adelson,et al.  Localization and manipulation of small parts using GelSight tactile sensing , 2014, IROS.

[10]  Alexei A. Efros,et al.  Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[11]  E. Adelson,et al.  Retrographic sensing for the measurement of surface texture and shape , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  KappassovZhanat,et al.  Tactile sensing in dexterous robot hands - Review , 2015 .

[13]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[14]  Stefan Leutenegger,et al.  ElasticFusion: Dense SLAM Without A Pose Graph , 2015, Robotics: Science and Systems.

[15]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrew Owens,et al.  The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes? , 2017, CoRL.

[18]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[19]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[20]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[22]  Wojciech Matusik,et al.  Learning the signatures of the human grasp using a scalable tactile glove , 2019, Nature.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Edward H. Adelson,et al.  Improved GelSight tactile sensor for measuring geometry and slip , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[27]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[29]  Fisher Yu,et al.  Scribbler: Controlling Deep Image Synthesis with Sketch and Color , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[31]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[32]  Edward H. Adelson,et al.  GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force , 2017, Sensors.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[36]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[37]  Mark R. Cutkosky,et al.  Force and Tactile Sensors , 2008, Springer Handbook of Robotics.

[38]  Andrew Owens,et al.  Shape-independent hardness estimation using deep learning and a GelSight tactile sensor , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[39]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[40]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[41]  R. Klatzky,et al.  Haptic perception: A tutorial , 2009, Attention, perception & psychophysics.

[42]  Edward H. Adelson,et al.  Connecting Look and Feel: Associating the Visual and Tactile Properties of Physical Materials , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Antonio Torralba,et al.  Cross-Modal Scene Networks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[46]  Jeffrey M Yau,et al.  Analogous intermediate shape coding in vision and touch , 2009, Proceedings of the National Academy of Sciences.

[47]  P. Abbeel,et al.  Yale-CMU-Berkeley dataset for robotic manipulation research , 2017, Int. J. Robotics Res..

[48]  Edward H. Adelson,et al.  Estimating object hardness with a GelSight touch sensor , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[49]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Silvio Savarese,et al.  Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[51]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[52]  R. Klatzky,et al.  Hand movements: A window into haptic object recognition , 1987, Cognitive Psychology.