论文信息 - Multi-modal Aggregation for Video Classification

Multi-modal Aggregation for Video Classification

In this paper, we present a solution to Large-Scale Video Classification Challenge (LSVC2017) [1] that ranked the 1st place. We focused on a variety of modalities that cover visual, motion and audio. Also, we visualized the aggregation process to better understand how each modality takes effect. Among the extracted modalities, we found Temporal-Spatial features calculated by 3D convolution quite promising that greatly improved the performance. We attained the official metric mAP 0.8741 on the testing set with the ensemble model.

Yang Liu | Xiaowei Zhao | Chen Chen

[1] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[4] Shih-Fu Chang,et al. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Bolei Zhou,et al. Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Gang Sun,et al. Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[8] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[9] Matthieu Guillaumin,et al. Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[10] Ivan Laptev,et al. Learnable pooling with Context Gating for video classification , 2017, ArXiv.