On Robustness of Multi-Modal Fusion—Robotics Perspective

The efficient multi-modal fusion of data streams from different sensors is a crucial ability that a robotic perception system should exhibit to ensure robustness against disturbances. However, as the volume and dimensionality of sensory-feedback increase it might be difficult to manually design a multimodal-data fusion system that can handle heterogeneous data. Nowadays, multi-modal machine learning is an emerging field with research focused mainly on analyzing vision and audio information. Although, from the robotics perspective, haptic sensations experienced from interaction with an environment are essential to successfully execute useful tasks. In our work, we compared four learning-based multi-modal fusion methods on three publicly available datasets containing haptic signals, images, and robots’ poses. During tests, we considered three tasks involving such data, namely grasp outcome classification, texture recognition, and—most challenging—multi-label classification of haptic adjectives based on haptic and visual data. Conducted experiments were focused not only on the verification of the performance of each method but mainly on their robustness against data degradation. We focused on this aspect of multi-modal fusion, as it was rarely considered in the research papers, and such degradation of sensory feedback might occur during robot interaction with its environment. Additionally, we verified the usefulness of data augmentation to increase the robustness of the aforementioned data fusion methods.

[1]  Edward H. Adelson,et al.  GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force , 2017, Sensors.

[2]  Jiajun Wu,et al.  See, feel, act: Hierarchical learning for complex manipulation skills with multisensory fusion , 2019, Science Robotics.

[3]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  K. Walas Terrain Classification and Negotiation with a Walking Robot , 2015, J. Intell. Robotic Syst..

[5]  Anna Choromanska,et al.  A deep learning gated architecture for UGV navigation robust to sensor failures , 2019, Robotics Auton. Syst..

[6]  Edward Adelson,et al.  Tracking objects with point clouds from vision and touch , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Federico Castanedo,et al.  A Review of Data Fusion Techniques , 2013, TheScientificWorldJournal.

[8]  Jong-Seok Lee,et al.  EmbraceNet: A robust deep learning architecture for multimodal classification , 2019, Inf. Fusion.

[9]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[10]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[11]  Fuchun Sun,et al.  Multi-Modal Local Receptive Field Extreme Learning Machine for object recognition , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[12]  Jeannette Bohg,et al.  Fusing visual and tactile sensing for 3-D object reconstruction while grasping , 2013, 2013 IEEE International Conference on Robotics and Automation.

[13]  Ali Kanso,et al.  Cross-Modal Associations between Color and Haptics , 2015, Attention, Perception, & Psychophysics.

[14]  Trevor Darrell,et al.  Robotic learning of haptic adjectives through physical interaction , 2015, Robotics Auton. Syst..

[15]  Qi Wang,et al.  RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning , 2019, IEEE Access.

[16]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[17]  J. Bell,et al.  The structure and function of pacinian corpuscles: A review , 1994, Progress in Neurobiology.

[18]  Yang Gao,et al.  Deep learning for tactile understanding from visual and haptic data , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Veronica J. Santos,et al.  Biomimetic Tactile Sensor Array , 2008, Adv. Robotics.