SBAT: Video Captioning with Sparse Boundary-Aware Transformer

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

[1]  Wei Liu,et al.  Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[3]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[5]  Zhongfei Zhang,et al.  TVT: Two-View Transformer Network for Video Captioning , 2018, ACML.

[6]  Yu-Wing Tai,et al.  Memory-Attended Recurrent Network for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Chuang Gan,et al.  Video Captioning with Multi-Faceted Attention , 2016, TACL.

[8]  Yingming Li,et al.  Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning , 2019, EMNLP.

[9]  Wei Liu,et al.  Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[12]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Zhou Su,et al.  Weakly Supervised Dense Video Captioning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Chuang Gan,et al.  Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.

[16]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Yingming Li,et al.  Recurrent convolutional video captioning with global and local attention , 2019, Neurocomputing.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Yu-Gang Jiang,et al.  Motion Guided Spatial Attention for Video Captioning , 2019, AAAI.

[21]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.