Memory Based Attentive Fusion

The use of multi-modal data for deep machine learning has shown promise when compared to uni-modal approaches, where fusion of multi-modal features has resulted in improved performance. However, most state-of-the-art methods use naive fusion which processes feature streams from a given time-step and ignores long-term dependencies within the data during fusion. In this paper, we present a novel Memory Based Attentive Fusion (MBAF) layer, which fuses modes by incorporating both the current features and long-term dependencies in the data, thus allowing the model to understand the relative importance of modes over time. We define an explicit memory block within the fusion layer which stores features containing long-term dependencies of the fused data. The inputs to our layer are fused through attentive composition and transformation, and the transformed features are combined with the input to generate the fused layer output. Following existing state-of-the-art methods, we have evaluated the performance and the generalizability of the proposed approach on the IEMOCAP and PhysioNet-CMEBS datasets with different modalities. In our experiments, we replace the naive fusion layer in benchmark networks with our proposed layer to enable a fair comparison. Experimental results indicate that MBAF layer can generalise across different modalities and networks to enhance the fusion and improve performance.

[1]  Sridha Sridharan,et al.  Attention Driven Fusion for Multi-Modal Emotion Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Seungryong Kim,et al.  LAF-Net: Locally Adaptive Fusion Networks for Stereo Confidence Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[4]  M. Shamim Hossain,et al.  Emotion recognition using deep learning approach from audio-visual emotional big data , 2019, Inf. Fusion.

[5]  Sridha Sridharan,et al.  Tree Memory Networks for Modelling Long-term Temporal Dependencies , 2017, Neurocomputing.

[6]  Giorgio Giacinto,et al.  Information fusion in content based image retrieval: A comprehensive overview , 2017, Inf. Fusion.

[7]  Kyomin Jung,et al.  Speech Emotion Recognition Using Multi-hop Attention Mechanism , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Alex Graves,et al.  Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes , 2016, NIPS.

[9]  Hassan Ghassemian,et al.  A review of remote sensing image fusion methods , 2016, Inf. Fusion.

[10]  Seungryong Kim,et al.  Context-Aware Emotion Recognition Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Rui Cao,et al.  Information fusion in visual question answering: A Survey , 2019, Inf. Fusion.

[12]  Ting Luo,et al.  Attention-based fusion network for human eye-fixation prediction in 3D images. , 2019, Optics express.

[13]  Juan Ramos-Castro,et al.  A comparison of heartbeat detectors for the seismocardiogram , 2013, Computing in Cardiology 2013.

[14]  Sridha Sridharan,et al.  Learning Salient Features for Multimodal Emotion Recognition with Recurrent Neural Networks and Attention Based Fusion , 2019 .

[15]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[16]  Wei Wang,et al.  Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering , 2018, ACL.

[17]  Ivor W. Tsang,et al.  Late Fusion via Subspace Search With Consistency Preservation , 2019, IEEE Transactions on Image Processing.

[18]  Najim Dehak,et al.  Deep Neural Networks for Emotion Recognition Combining Audio and Transcripts , 2018, INTERSPEECH.

[19]  Sridha Sridharan,et al.  Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Tianyi Wang,et al.  Residual Attention-Based Fusion for Video Classification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[21]  Klaus C. J. Dietmayer,et al.  Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges , 2019, IEEE Transactions on Intelligent Transportation Systems.

[22]  Seungyong Lee,et al.  RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Karl Zipser,et al.  MultiNet: Multi-Modal Multi-Task Learning for Autonomous Driving , 2017, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[24]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[25]  Jun Long,et al.  DFTerNet: Towards 2-bit Dynamic Fusion Networks for Accurate Human Activity Recognition , 2018, IEEE Access.

[26]  Hong Yu,et al.  Neural Semantic Encoders , 2016, EACL.

[27]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[28]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Raymond W. M. Ng,et al.  Multi-Modal Sequence Fusion via Recursive Attention for Emotion Recognition , 2018, CoNLL.

[30]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[31]  Bin Li,et al.  Multi-sensor fusion methodology for enhanced land vehicle positioning , 2019, Inf. Fusion.

[32]  Luca Benini,et al.  A sensor fusion approach for drowsiness detection in wearable ultra-low-power systems , 2018, Inf. Fusion.

[33]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[34]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[35]  Kyomin Jung,et al.  Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[36]  Rohit Kumar,et al.  Ensemble of SVM trees for multimodal emotion recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[37]  Sridha Sridharan,et al.  Learning Temporal Strategic Relationships using Generative Adversarial Imitation Learning , 2018, AAMAS.

[38]  Michael Ying Yang,et al.  Fusion of Multispectral Data Through Illumination-aware Deep Neural Networks for Pedestrian Detection , 2018, Inf. Fusion.

[39]  Yasushi Makihara,et al.  MultiQ: single sensor-based multi-quality multi-modal large-scale biometric score database and its performance evaluation , 2017, IPSJ Transactions on Computer Vision and Applications.

[40]  Arun Ross,et al.  A Comprehensive Overview of Biometric Fusion , 2019, Inf. Fusion.

[41]  Jason Weston,et al.  Key-Value Memory Networks for Directly Reading Documents , 2016, EMNLP.

[42]  Matthew Turk,et al.  Multimodal interaction: A review , 2014, Pattern Recognit. Lett..

[43]  Bo Liu,et al.  Multimodal image seamless fusion , 2019, J. Electronic Imaging.

[44]  Ying Han,et al.  Structure-aware image fusion , 2018, Optik.

[45]  Henggang Cui,et al.  Multimodal Trajectory Predictions for Autonomous Driving using Deep Convolutional Networks , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[46]  V. Calhoun,et al.  Multimodal fusion of brain imaging data: A key to finding the missing link(s) in complex mental illness. , 2016, Biological psychiatry. Cognitive neuroscience and neuroimaging.

[47]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[48]  Loris Nanni,et al.  Overview of the combination of biometric matchers , 2017, Inf. Fusion.

[49]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[50]  Jing Ma,et al.  Multi-sensor distributed fusion estimation with applications in networked systems: A review paper , 2017, Inf. Fusion.

[51]  Junjun Jiang,et al.  FusionGAN: A generative adversarial network for infrared and visible image fusion , 2019, Inf. Fusion.

[52]  Mohamed Abdel-Mottaleb,et al.  Discriminant Correlation Analysis: Real-Time Feature Level Fusion for Multimodal Biometric Recognition , 2016, IEEE Transactions on Information Forensics and Security.

[53]  Xiao Ma,et al.  EARS: Emotion-aware recommender system based on hybrid information fusion , 2019, Inf. Fusion.

[54]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[55]  Sridha Sridharan,et al.  Deep Spatio-Temporal Features for Multimodal Emotion Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[56]  Wei Li,et al.  A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition , 2020, Inf. Fusion.

[57]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[58]  Sridha Sridharan,et al.  Going Deeper: Autonomous Steering with Neural Memory Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[59]  Silvio Savarese,et al.  DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Yongfeng Zhang,et al.  Sequential Recommendation with User Memory Networks , 2018, WSDM.