Multilevel Sensor Fusion With Deep Learning

In the context of deep learning, this article presents an original deep network, namely CentralNet, for the fusion of information coming from different sensors. This approach is designed to efficiently and automatically balance the tradeoff between early and late fusion (i.e., between the fusion of low-level versus high-level information). More specifically, at each level of abstraction—the different levels of deep networks—unimodal representations of the data are fed to a central neural network which combines them into a common embedding. In addition, a multiobjective regularization is also introduced, helping to both optimize the central network and the unimodal networks. Experiments on four multimodal datasets not only show the state-of-the-art performance but also demonstrate that CentralNet can actually choose the best possible fusion strategy for a given problem.

[1]  Byung Cheol Song,et al.  Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild , 2017, ICMI.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Fabien Ringeval,et al.  Summary for AVEC 2017: Real-life Depression and Affect Challenge and Workshop , 2017, ACM Multimedia.

[4]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Ngai-Man Cheung,et al.  Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text , 2017, ArXiv.

[6]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.

[7]  Christian Wolf,et al.  Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9]  Sergio Escalera,et al.  ChaLearn Looking at People Challenge 2014: Dataset and Results , 2014, ECCV Workshops.

[10]  Christian Wolf,et al.  Modout: Learning to Fuse Modalities via Stochastic Regularization , 2016 .

[11]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[12]  Christian Jutten,et al.  Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects , 2015, Proceedings of the IEEE.

[13]  Frédéric Jurie,et al.  An Occam's Razor View on Learning Audiovisual Emotion Recognition with Small Training Sets , 2018, ICMI.

[14]  Fabio A. González,et al.  Gated Multimodal Units for Information Fusion , 2017, ICLR.

[15]  Zhao Lin,et al.  Contextual Region-Based Convolutional Neural Network with Multilayer Fusion for SAR Ship Detection , 2017, Remote. Sens..

[16]  Pavlo Molchanov,et al.  Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification , 2016, ACM Multimedia.

[17]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[18]  Sen Wang,et al.  Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[19]  Christian Wolf,et al.  Multi-scale Deep Learning for Gesture Detection and Localization , 2014, ECCV Workshops.

[20]  Tomas Mikolov,et al.  Efficient Large-Scale Multi-Modal Classification , 2018, AAAI.

[21]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[23]  Frédéric Jurie,et al.  CentralNet: a Multilayer Approach for Multimodal Fusion , 2018, ECCV Workshops.