AWDF: An Adaptive Weighted Deep Fusion Architecture for Multi-modality Learning

Fusion has been widely used in machine learning community, especially for problems dealing with multiple input sources and classifiers. The general strategy for information fusion in deep neural network is to directly concatenate the embedding features on the latent space of input sources. However, it is very hard to capture the relative importance of fused sources. It is also impossible to learn the correlation among fused multimodalities inputs, e.g., intra-class and inter-class similarities. Besides, most existing deep learning fusion approaches use universal fusion weights strategy, which cannot fully exploit the relative importance of different inputs. In order to address these problems, in this work we propose an Adaptive Weighted Deep Fusion scheme (AWDF) to capture potential relationships among various input sources. It integrates the feature level and decision level fusion in one framework. Furthermore, in order to address the limitations of existing fusing models with fixed weights, we propose a new scheme named Cross Decision Weights Method (CDWM). It can dynamically learn the weight for each input branch during the fusion process instead of utilizing pre-defined weights. To evaluate the performance of AWDF, we conduct experiments on three different real-world datasets: Wild Business Terms (WBT) Dataset, Iceberg Detection Dataset and CareerCon Dataset. Our experimental results demonstrate the superiority of AWDF over other fusion approaches.

[1]  Akshita Gupta,et al.  Acoustic Features Fusion using Attentive Multi-channel Deep Architecture , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[2]  Frédéric Jurie,et al.  MFAS: Multimodal Fusion Architecture Search , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jiwen Lu,et al.  MMSS: Multi-modal Sharable and Specific Feature Learning for RGB-D Object Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[5]  Heather N. Watson,et al.  Use of electronic medical records (EMR) for oncology outcomes research: assessing the comparability of EMR information to patient registry and health claims data , 2011, Clinical epidemiology.

[6]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Tieniu Tan,et al.  DF2Net: Discriminative Feature Learning and Fusion Network for RGB-D Indoor Scene Classification , 2018, AAAI.

[8]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[9]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[10]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[11]  Sukhendu Das,et al.  A Survey of Decision Fusion and Feature Fusion Strategies for Pattern Classification , 2010, IETE Technical Review.

[12]  B. Dean,et al.  Review: Use of Electronic Medical Records for Health Outcomes Research , 2009, Medical care research and review : MCRR.

[13]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[14]  Han Wang,et al.  AcFR: Active Face Recognition Using Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Olga R. P. Bellon,et al.  AUMPNet: Simultaneous Action Units Detection and Intensity Estimation on Multipose Facial Images Using a Single Convolutional Neural Network , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[16]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[17]  Huy Phan,et al.  Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  S. Squartini,et al.  DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks , 2016, DCASE.

[19]  Christian Wolf,et al.  Multi-scale Deep Learning for Gesture Detection and Localization , 2014, ECCV Workshops.

[20]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Nicu Sebe,et al.  Learning Deep Representations of Appearance and Motion for Anomalous Event Detection , 2015, BMVC.

[23]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).