论文信息 - Sketch, Ground, and Refine: Top-Down Dense Video Captioning

Sketch, Ground, and Refine: Top-Down Dense Video Captioning

The dense video captioning task aims to detect and describe a sequence of events in a video for detailed and coherent storytelling. Previous works mainly adopt a "detect-then-describe" framework, which firstly detects event proposals in the video and then generates descriptions for the detected events. However, the definitions of events are diverse which could be as simple as a single action or as complex as a set of events, depending on different semantic con-texts. Therefore, directly detecting events based on video information is ill-defined and hurts the coherency and accuracy of generated dense captions. In this work, we reverse the predominant "detect-then-describe" fashion, proposing a top-down way to first generate paragraphs from a global view and then ground each event description to a video segment for detailed refinement. It is formulated as a Sketch, Ground, and Refine process (SGR). The sketch stage first generates a coarse-grained multi-sentence paragraph to describe the whole video, where each sentence is treated as an event and gets localised in the grounding stage. In the re-fining stage, we improve captioning quality via refinement-enhanced training and dual-path cross attention on both coarse-grained event captions and aligned event segments. The updated event caption can further adjust its segment boundaries. Our SGR model outperforms state-of-the-art methods on ActivityNet Captioning benchmark under traditional and story-oriented dense caption evaluations. Code will be released at github.com/bearcatt/SGR.

[1] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[2] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3] Wei Liu,et al. Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Tao Mei,et al. Jointly Localizing and Describing Events for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[7] Masaaki Nagata,et al. SODA: Story Oriented Dense Video Captioning Evaluation Framework , 2020, ECCV.

[8] Yitian Yuan,et al. Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Tao Mei,et al. Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Vaibhava Goel,et al. Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Shilei Wen,et al. BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[14] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Heng Tao Shen,et al. Deliberate Attention Networks for Image Captioning , 2019, AAAI.

[16] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Luowei Zhou,et al. End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Yu Qiao,et al. Find and Focus: Retrieve and Localize Video Events with Natural Language Queries , 2018, ECCV.

[21] Bohyung Han,et al. Streamlined Dense Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Tao Mei,et al. Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Shih-Fu Chang,et al. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Bohyung Han,et al. Local-Global Video-Text Interactions for Temporal Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Wei Li,et al. CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016 , 2016, ArXiv.

[26] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[27] Limin Wang,et al. Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28] Tieniu Tan,et al. M3: Multimodal Memory Modelling for Video Captioning , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[30] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[31] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[32] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33] Runhao Zeng,et al. Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Mohit Bansal,et al. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning , 2020, ACL.

[35] Yi Yang,et al. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Bo Dai,et al. Move Forward and Tell: A Progressive Generator of Video Descriptions , 2018, ECCV.

[37] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38] Ming Yang,et al. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[39] Wei Liu,et al. Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[41] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[43] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.