Memory based fusion for multi-modal deep learning

The use of multi-modal data for deep machine learning has shown promise when compared to uni-modal approaches with fusion of multi-modal features resulting in improved performance in several applications. However, most state-of-the-art methods use naive fusion which processes feature streams independently, ignoring possible long-term dependencies within the data during fusion. In this paper, we present a novel Memory based Attentive Fusion layer, which fuses modes by incorporating both the current features and longterm dependencies in the data, thus allowing the model to understand the relative importance of modes over time. We introduce an explicit memory block within the fusion layer which stores features containing long-term dependencies of the fused data. The feature inputs from uni-modal encoders are fused through attentive composition and transformation followed by naive fusion of the resultant memory derived features with layer inputs. Following state-of-the-art methods, we have evaluated the performance and the generalizability of the proposed fusion approach on two different datasets with different modalities. In our experiments, we replace the naive fusion layer in benchmark networks with our proposed layer to enable a fair comparison. Experimental results indicate that the MBAF layer can generalise across different modalities and networks to enhance fusion and improve performance.

[1]  Henggang Cui,et al.  Multimodal Trajectory Predictions for Autonomous Driving using Deep Convolutional Networks , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[2]  Arun Ross,et al.  A Comprehensive Overview of Biometric Fusion , 2019, Inf. Fusion.

[3]  Klaus C. J. Dietmayer,et al.  Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges , 2019, IEEE Transactions on Intelligent Transportation Systems.

[4]  Ying Han,et al.  Structure-aware image fusion , 2018, Optik.

[5]  Hong Yu,et al.  Neural Semantic Encoders , 2016, EACL.

[6]  Sridha Sridharan,et al.  Tree Memory Networks for Modelling Long-term Temporal Dependencies , 2017, Neurocomputing.

[7]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[8]  Sridha Sridharan,et al.  Deep Spatio-Temporal Features for Multimodal Emotion Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9]  Najim Dehak,et al.  Deep Neural Networks for Emotion Recognition Combining Audio and Transcripts , 2018, INTERSPEECH.

[10]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[11]  Yasushi Makihara,et al.  MultiQ: single sensor-based multi-quality multi-modal large-scale biometric score database and its performance evaluation , 2017, IPSJ Transactions on Computer Vision and Applications.

[12]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Wei Yu,et al.  Infrared and visible image fusion via detail preserving adversarial learning , 2020, Inf. Fusion.

[14]  Yanpeng Cao,et al.  Pedestrian detection with unsupervised multispectral feature learning using deep neural networks , 2019, Inf. Fusion.

[15]  Jun Long,et al.  DFTerNet: Towards 2-bit Dynamic Fusion Networks for Accurate Human Activity Recognition , 2018, IEEE Access.

[16]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[17]  Sridha Sridharan,et al.  Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[18]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[19]  Alex Graves,et al.  Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes , 2016, NIPS.

[20]  Karl Zipser,et al.  MultiNet: Multi-Modal Multi-Task Learning for Autonomous Driving , 2017, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[21]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[22]  Bin Li,et al.  Multi-sensor fusion methodology for enhanced land vehicle positioning , 2019, Inf. Fusion.

[23]  Hassan Ghassemian,et al.  A review of remote sensing image fusion methods , 2016, Inf. Fusion.

[24]  Xiao Ma,et al.  EARS: Emotion-aware recommender system based on hybrid information fusion , 2019, Inf. Fusion.

[25]  Juan Ramos-Castro,et al.  A comparison of heartbeat detectors for the seismocardiogram , 2013, Computing in Cardiology 2013.

[26]  Tianyi Wang,et al.  Residual Attention-Based Fusion for Video Classification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Ting Luo,et al.  Attention-based fusion network for human eye-fixation prediction in 3D images. , 2019, Optics express.

[28]  Kyomin Jung,et al.  Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[29]  Seungryong Kim,et al.  LAF-Net: Locally Adaptive Fusion Networks for Stereo Confidence Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Michael Ying Yang,et al.  Fusion of Multispectral Data Through Illumination-aware Deep Neural Networks for Pedestrian Detection , 2018, Inf. Fusion.

[31]  Yongfeng Zhang,et al.  Sequential Recommendation with User Memory Networks , 2018, WSDM.

[32]  Sridha Sridharan,et al.  Memory Augmented Deep Generative Models for Forecasting the Next Shot Location in Tennis , 2019, IEEE Transactions on Knowledge and Data Engineering.

[33]  Rohit Kumar,et al.  Ensemble of SVM trees for multimodal emotion recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[34]  V. Calhoun,et al.  Multimodal fusion of brain imaging data: A key to finding the missing link(s) in complex mental illness. , 2016, Biological psychiatry. Cognitive neuroscience and neuroimaging.

[35]  Kyomin Jung,et al.  Speech Emotion Recognition Using Multi-hop Attention Mechanism , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Seungyong Lee,et al.  RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Jason Weston,et al.  Key-Value Memory Networks for Directly Reading Documents , 2016, EMNLP.

[39]  Junjun Jiang,et al.  FusionGAN: A generative adversarial network for infrared and visible image fusion , 2019, Inf. Fusion.

[40]  Sridha Sridharan,et al.  Going Deeper: Autonomous Steering with Neural Memory Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[41]  Luca Benini,et al.  A sensor fusion approach for drowsiness detection in wearable ultra-low-power systems , 2018, Inf. Fusion.

[42]  Wei Wang,et al.  Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering , 2018, ACL.

[43]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[44]  Silvio Savarese,et al.  DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[46]  M. Shamim Hossain,et al.  Emotion recognition using deep learning approach from audio-visual emotional big data , 2019, Inf. Fusion.

[47]  Peng Zhang,et al.  Affective video content analysis based on multimodal data fusion in heterogeneous networks , 2019, Inf. Fusion.

[48]  Ivor W. Tsang,et al.  Late Fusion via Subspace Search With Consistency Preservation , 2019, IEEE Transactions on Image Processing.

[49]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[50]  Haibo Wang,et al.  Fast filtering image fusion , 2017, J. Electronic Imaging.

[51]  Sridha Sridharan,et al.  Attention Driven Fusion for Multi-Modal Emotion Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Wei Li,et al.  A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition , 2020, Inf. Fusion.

[53]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[54]  Teh Ying Wah,et al.  Data fusion and multiple classifier systems for human activity detection and health monitoring: Review and open research directions , 2019, Inf. Fusion.

[55]  Bo Liu,et al.  Multimodal image seamless fusion , 2019, J. Electronic Imaging.

[56]  Sridha Sridharan,et al.  Learning Temporal Strategic Relationships using Generative Adversarial Imitation Learning , 2018, AAMAS.

[57]  Giorgio Giacinto,et al.  Information fusion in content based image retrieval: A comprehensive overview , 2017, Inf. Fusion.

[58]  Mohamed Abdel-Mottaleb,et al.  Discriminant Correlation Analysis: Real-Time Feature Level Fusion for Multimodal Biometric Recognition , 2016, IEEE Transactions on Information Forensics and Security.

[59]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[60]  Rui Cao,et al.  Information fusion in visual question answering: A Survey , 2019, Inf. Fusion.

[61]  Yu Liu,et al.  IFCNN: A general image fusion framework based on convolutional neural network , 2020, Inf. Fusion.

[62]  Seungryong Kim,et al.  Context-Aware Emotion Recognition Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[63]  Jiayi Ma,et al.  Infrared and visible image fusion methods and applications: A survey , 2018, Inf. Fusion.

[64]  Jing Ma,et al.  Multi-sensor distributed fusion estimation with applications in networked systems: A review paper , 2017, Inf. Fusion.

[65]  Loris Nanni,et al.  Overview of the combination of biometric matchers , 2017, Inf. Fusion.

[66]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[67]  Matthew Turk,et al.  Multimodal interaction: A review , 2014, Pattern Recognit. Lett..