Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Leveraging both visual frames and audio has been experimentally proven effective to improve large-scale video classification. Previous research on video classification mainly focuses on the analysis of visual content among extracted video frames and their temporal feature aggregation. In contrast, multimodal data fusion is achieved by simple operators like average and concatenation. Inspired by the success of bilinear pooling in the visual and language fusion, we introduce multi-modal factorized bilinear pooling (MFB) to fuse visual and audio representations. We combine MFB with different video-level features and explore its effectiveness in video classification. Experimental results on the challenging Youtube-8M v2 dataset demonstrate that MFB significantly outperforms simple fusion methods in large-scale video classification.

[1]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Josef Sivic,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5]  Charless C. Fowlkes,et al.  Bilinear classifiers for visual recognition , 2009, NIPS.

[6]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[21]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[24]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[26]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[27]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.