Hierarchical Sequence Representation with Graph Network

Video classification problem is a challenging task in computer vision. The performance of this task is highly relied on the scale of training data and the effectiveness of video embedding via a robust embedding network. Unsupervised solutions such as feature average pooling technique, as a simple label-independent and parameter-free based method, cannot efficiently represent the video sequences. While supervised methods, such as RNN, can improve the recognition accuracy. The performance of RNN based methods, however, is decreased with the increasing length of the videos and the hierarchical relationships between frames across events in the video. In this paper, we propose a novel video classification method based on a deep convolutional graph neural network (DCGN). The proposed method utilizes the characteristics of the hierarchical structure of the video, and performed multi-level embedding feature extraction on the video frame sequence through the graph network, and obtained a video representation which reflects the event semantics hierarchically. Experiments on YouTube-8M Large-Scale Video Understanding dataset show that our proposed model outperforms the commonly used RNN based models, verifying its effectiveness for video classification.

[1]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[2]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[3]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[4]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Shuai Li,et al.  Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[9]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[10]  Xirong Li,et al.  Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[12]  Jianfeng Dong,et al.  Noise Learning for Weakly Supervised Segment Classification in Video , 2019 .