CentralNet: a Multilayer Approach for Multimodal Fusion

This paper proposes a novel multimodal fusion approach, aiming to produce best possible decisions by integrating information coming from multiple media. While most of the past multimodal approaches either work by projecting the features of different modalities into the same space, or by coordinating the representations of each modality through the use of constraints, our approach borrows from both visions. More specifically, assuming each modality can be processed by a separated deep convolutional network, allowing to take decisions independently from each modality, we introduce a central network linking the modality specific networks. This central network not only provides a common feature embedding but also regularizes the modality specific networks through the use of multi-task learning. The proposed approach is validated on 4 different computer vision tasks on which it consistently improves the accuracy of existing multimodal fusion approaches.

[1]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Lei Huang,et al.  Learning Joint Multimodal Representation Based on Multi-fusion Deep Neural Networks , 2017, ICONIP.

[3]  Christian Wolf,et al.  Multi-scale Deep Learning for Gesture Detection and Localization , 2014, ECCV Workshops.

[4]  Pietro Liò,et al.  XFlow: 1D-2D Cross-modal Deep Neural Networks for Audiovisual Classification , 2017, ArXiv.

[5]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Ngai-Man Cheung,et al.  Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text , 2017, ArXiv.

[7]  Fabien Ringeval,et al.  Summary for AVEC 2017: Real-life Depression and Affect Challenge and Workshop , 2017, ACM Multimedia.

[8]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[9]  Zhao Lin,et al.  Contextual Region-Based Convolutional Neural Network with Multilayer Fusion for SAR Ship Detection , 2017, Remote. Sens..

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Tomas Mikolov,et al.  Efficient Large-Scale Multi-Modal Classification , 2018, AAAI.

[13]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[14]  Sen Wang,et al.  Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[15]  Christian Wolf,et al.  Modout: Learning to Fuse Modalities via Stochastic Regularization , 2016 .

[16]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17]  Sergio Escalera,et al.  ChaLearn Looking at People Challenge 2014: Dataset and Results , 2014, ECCV Workshops.

[18]  Byung Cheol Song,et al.  Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild , 2017, ICMI.

[19]  Fabio A. González,et al.  Gated Multimodal Units for Information Fusion , 2017, ICLR.

[20]  Pavlo Molchanov,et al.  Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification , 2016, ACM Multimedia.

[21]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[22]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.

[23]  Hugo Larochelle,et al.  Correlational Neural Networks , 2015, Neural Computation.

[24]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[25]  Ping Hu,et al.  Learning supervised scoring ensemble for emotion recognition in the wild , 2017, ICMI.

[26]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[27]  Pietro Liò,et al.  XFlow: Cross-Modal Deep Neural Networks for Audiovisual Classification , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[28]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[29]  Christian Jutten,et al.  Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects , 2015, Proceedings of the IEEE.