Noise Learning for Weakly Supervised Segment Classification in Video

This paper describes our solution for the 3rd YouTube8M video understanding challenge. The challenge of this year is different from the previous challenge. Given a large scale video dataset with video-level labels and a small scale video dataset with segment-level labels, we are asked to recognize segments in videos this year. It can be regarded as a weakly supervised learning problem. To answer the challenge, we propose a solution consists of three different models, i.e., segment-level classifier, self-attention mechanism, noise learning classifier. Among them, the noise learning classifier performs the best. By noise learning, it can reduce the noise of label and sample for training, and improve the performance. Moreover, we achieve the MAP of 0.78878 in the private leaderboard by model ensemble based on introduced models, ranking the 8th place on the challenge.

[1]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[2]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Weilin Huang,et al.  CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images , 2018, ECCV.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[7]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[8]  Shiguang Shan,et al.  Weakly Supervised Image Classification Through Noise Regularization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Sergey I. Nikolenko,et al.  Label Denoising with Large Ensembles of Heterogeneous Neural Networks , 2018, ECCV Workshops.

[11]  Jianping Fan,et al.  NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification , 2018, ECCV Workshops.

[12]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  David Austin,et al.  Building A Size Constrained Predictive Models for Video Classification , 2018, ECCV Workshops.

[14]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Feng Mao,et al.  Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network , 2018, ECCV Workshops.