Enhancing the alignment between target words and corresponding frames for video captioning

Abstract Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 1

[1]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[4]  Xiang Bai,et al.  Text/non-text image classification in the wild with convolutional neural networks , 2017, Pattern Recognit..

[5]  Trevor Darrell,et al.  Learning to Segment Every Thing , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yongdong Zhang,et al.  STAT: Spatial-Temporal Attention Mechanism for Video Captioning , 2020, IEEE Transactions on Multimedia.

[8]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Zhongfei Zhang,et al.  TVT: Two-View Transformer Network for Video Captioning , 2018, ACML.

[12]  Bingbing Ni,et al.  Learning explicit video attributes from mid-level representation for video captioning , 2017, Comput. Vis. Image Underst..

[13]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yongdong Zhang,et al.  Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Wei Liu,et al.  Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Wei Wei,et al.  Video Captioning with Semantic Guiding , 2018, 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM).

[18]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Tao Mei,et al.  Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning , 2019, AAAI.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Chenggang Clarence Yan,et al.  Video Description with Spatial-Temporal Attention , 2017, ACM Multimedia.

[22]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[24]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[25]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Heng Tao Shen,et al.  Hierarchical LSTMs with Adaptive Attention for Visual Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Chuang Gan,et al.  Video Captioning with Multi-Faceted Attention , 2016, TACL.

[29]  M. Hemalatha,et al.  Domain-Specific Semantics Guided Approach to Video Captioning , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30]  Ting Liu,et al.  Recent advances in convolutional neural networks , 2015, Pattern Recognit..

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Wei Liu,et al.  Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Shuqiang Jiang,et al.  Attention-based Densely Connected LSTM for Video Captioning , 2019, ACM Multimedia.

[36]  Dacheng Tao,et al.  Syntax-Aware Action Targeting for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Heng Tao Shen,et al.  Attention-based LSTM with Semantic Consistency for Videos Captioning , 2016, ACM Multimedia.

[38]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[39]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[40]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.