Improving Video Captioning with Non-Local Neural Networks

Unlike static images, a video contains not only visual features but also more semantic meanings or relationships between the objects and scenes due to its temporal attribute. There have been many attempts to describe spatial and temporal relationships in videos, but the encoder-decoder based models are not enough to capture detailed relationships in videos. Specifically, a video clip often consists of several shots that seem to be unrelated, and simple recurrent model suffer from these change of shots. Recently, some studies have introduced the approach describing visual relations with relational reasoning on visual question answering and action recognition tasks. In this paper, we introduce an approach to capture temporal relationship with non-local block and boundary-awareness system. We evaluate our approach on Microsoft Video Description Corpus (MSVD, YouTube2Text) dataset. Experimental results show that non-local block applied along the temporal axis can improve video captioning performance on the MSVD dataset.

[1]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ting Yao,et al.  Deep Learning for Video Classification and Captioning , 2016, Frontiers of Multimedia Research.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Rita Cucchiara,et al.  Hierarchical Boundary-Aware Neural Encoder for Video Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[12]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[13]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.