论文信息 - MFAS: Multimodal Fusion Architecture Search

MFAS: Multimodal Fusion Architecture Search

We tackle the problem of finding good architectures for multimodal classification problems. We propose a novel and generic search space that spans a large number of possible fusion architectures. In order to find an optimal architecture for a given dataset in the proposed search space, we leverage an efficient sequential model-based exploration approach that is tailored for the problem. We demonstrate the value of posing multimodal fusion as a neural architecture search problem by extensive experimentation on a toy dataset and two other real multimodal datasets. We discover fusion architectures that exhibit state-of-the-art performance for problems with different domain and dataset size, including the \ntu~dataset, the largest multimodal action recognition dataset available.

[1] Chao Li,et al. Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[2] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.

[3] Quoc V. Le,et al. Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[4] Kevin Leyton-Brown,et al. Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[5] Stéphane Pateux,et al. Efficient Progressive Neural Architecture Search , 2018, BMVC.

[6] David Picard,et al. 2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7] Christian Wolf,et al. Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[8] Pavlo Molchanov,et al. Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification , 2016, ACM Multimedia.

[9] Geoffrey E. Hinton,et al. Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[10] Mohan S. Kankanhalli,et al. Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[11] Edward Y. Chang,et al. Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[12] Xiaoli Zhou,et al. Feature fusion of side face and gait for video-based human identification , 2008, Pattern Recognit..

[13] Gang Wang,et al. Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Martin Engilberge,et al. Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16] Frédéric Jurie,et al. CentralNet: a Multilayer Approach for Multimodal Fusion , 2018, ECCV Workshops.

[17] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[20] Jian-Huang Lai,et al. Deep Bilinear Learning for RGB-D Action Recognition , 2018, ECCV.

[21] Shuang Wu,et al. Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Xiao Liu,et al. Multimodal Keyless Attention Fusion for Video Classification , 2018, AAAI.

[23] Dong Liu,et al. Robust late fusion with rank minimization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24] Shih-Fu Chang,et al. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[26] Fabio A. González,et al. Gated Multimodal Units for Information Fusion , 2017, ICLR.

[27] Quoc V. Le,et al. Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[28] Christian Wolf,et al. Multi-scale Deep Learning for Gesture Detection and Localization , 2014, ECCV Workshops.

[29] Lei Huang,et al. Learning Joint Multimodal Representation Based on Multi-fusion Deep Neural Networks , 2017, ICONIP.

[30] Mohamed R. Amer,et al. Deep Multimodal Fusion: A Hybrid Approach , 2017, International Journal of Computer Vision.

[31] Bernard Ghanem,et al. On the relationship between visual attributes and convolutional networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Li Fei-Fei,et al. Progressive Neural Architecture Search , 2017, ECCV.

[33] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[34] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[35] John R. Hershey,et al. Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36] Theodore Lim,et al. SMASH: One-Shot Model Architecture Search through HyperNetworks , 2017, ICLR.

[37] Christian Wolf,et al. Modout: Learning to Fuse Modalities via Stochastic Regularization , 2016 .

[38] Vijay Vasudevan,et al. Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Yoshua Bengio,et al. Maxout Networks , 2013, ICML.

[40] Lawrence D. Jackel,et al. Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[41] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[42] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[44] Gang Wang,et al. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Christian Wolf,et al. ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[46] Michael I. Jordan,et al. Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[47] Christian Wolf,et al. Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48] Wenjun Zeng,et al. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.