STAT: Spatial-Temporal Attention Mechanism for Video Captioning

Video captioning refers to automatic generate natural language sentences, which summarize the video contents. Inspired by the visual attention mechanism of human beings, temporal attention mechanism has been widely used in video description to selectively focus on important frames. However, most existing methods based on temporal attention mechanism suffer from the problems of recognition error and detail missing, because temporal attention mechanism cannot further catch significant regions in frames. In order to address above problems, we propose the use of a novel spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning. The proposed STAT successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction. We evaluate our STAT on two well-known benchmarks: MSVD and MSR-VTT-10K. Experimental results show that our proposed STAT achieves the state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, and CIDEr.

[1]  Yuxin Peng,et al.  Hierarchical Vision-Language Alignment for Video Captioning , 2018, MMM.

[2]  Yuxin Peng,et al.  Two-Stream Collaborative Learning With Spatial-Temporal Attention for Video Classification , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Heng Tao Shen,et al.  Video Captioning by Adversarial LSTM , 2018, IEEE Transactions on Image Processing.

[4]  Yuxin Peng,et al.  Cross-modal Bidirectional Translation via Reinforcement Learning , 2018, IJCAI.

[5]  Qi Tian,et al.  Sequential Video VLAD: Training the Aggregation Locally and Temporally , 2018, IEEE Transactions on Image Processing.

[6]  Yuxin Peng,et al.  Visual Data Synthesis via GAN for Zero-Shot Video Classification , 2018, IJCAI.

[7]  Yongdong Zhang,et al.  GLA: Global–Local Attention for Image Description , 2018, IEEE Transactions on Multimedia.

[8]  Meng Wang,et al.  Self-Supervised Video Hashing With Hierarchical Binary Auto-Encoder , 2018, IEEE Transactions on Image Processing.

[9]  Yongdong Zhang,et al.  Effective Uyghur Language Text Detection in Complex Background Images for Traffic Prompt Identification , 2018, IEEE Transactions on Intelligent Transportation Systems.

[10]  Yongdong Zhang,et al.  Supervised Hash Coding With Deep Neural Network for Environment Perception of Intelligent Vehicles , 2018, IEEE Transactions on Intelligent Transportation Systems.

[11]  Heng Tao Shen,et al.  Video Captioning With Attention-Based LSTM and Semantic Consistency , 2017, IEEE Transactions on Multimedia.

[12]  Meng Wang,et al.  VideoWhisper: Toward Discriminative Unsupervised Video Feature Learning With Attention-Based Recurrent Neural Networks , 2017, IEEE Transactions on Multimedia.

[13]  Yongdong Zhang,et al.  Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[16]  Xirong Li,et al.  Early Embedding and Late Reranking for Video Captioning , 2016, ACM Multimedia.

[17]  Marcus Rohrbach,et al.  Multimodal Video Description , 2016, ACM Multimedia.

[18]  Jia Chen,et al.  Describing Videos using Multi-modal Fusion , 2016, ACM Multimedia.

[19]  Jorma Laaksonen,et al.  Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation , 2016, ACM Multimedia.

[20]  Yongbing Zhang,et al.  A New Method for Nonlocal Means Image Denoising Using Multiple Images , 2016, PloS one.

[21]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jorma Laaksonen,et al.  Video captioning with recurrent networks based on frame- and video-level features and visual content classification , 2015, ArXiv.

[28]  Chong-Wah Ngo,et al.  Deep Multimodal Learning for Affective Analysis and Retrieval , 2015, IEEE Transactions on Multimedia.

[29]  Palaiahnakote Shivakumara,et al.  A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video , 2015, IEEE Transactions on Multimedia.

[30]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Fei-Fei Li,et al.  Learning Temporal Embeddings for Complex Video Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[33]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[35]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[42]  Yongdong Zhang,et al.  A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors , 2014, IEEE Signal Processing Letters.

[43]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Weiqiang Wang,et al.  Robustly Extracting Captions in Videos Based on Stroke-Like Edges and Spatio-Temporal Analysis , 2012, IEEE Transactions on Multimedia.

[45]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[46]  Jerry D. Gibson,et al.  Routing-Aware Multiple Description Video Coding Over Mobile Ad-Hoc Networks , 2011, IEEE Transactions on Multimedia.

[47]  Changsheng Xu,et al.  Using Webcast Text for Semantic Event Detection in Broadcast Sports Video , 2008, IEEE Transactions on Multimedia.

[48]  Changsheng Xu,et al.  A Novel Framework for Semantic Annotation and Personalized Retrieval of Sports Video , 2008, IEEE Transactions on Multimedia.

[49]  José Manuel Menéndez,et al.  A unified model for techniques on video-shot transition detection , 2005, IEEE Transactions on Multimedia.

[50]  A. Murat Tekalp,et al.  Two-stage hierarchical video summary extraction to match low-level user browsing preferences , 2003, IEEE Trans. Multim..

[51]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.