Best Vision Technologies Submission to ActivityNet Challenge 2018-Task: Dense-Captioning Events in Videos

This note describes the details of our solution to the dense-captioning events in videos task of ActivityNet Challenge 2018. Specifically, we solve this problem with a two-stage way, i.e., first temporal event proposal and then sentence generation. For temporal event proposal, we directly leverage the three-stage workflow in [13, 16]. For sentence generation, we capitalize on LSTM-based captioning framework with temporal attention mechanism (dubbed as LSTM-T). Moreover, the input visual sequence to the LSTM-based video captioning model is comprised of RGB and optical flow images. At inference, we adopt a late fusion scheme to fuse the two LSTM-based captioning models for sentence generation.

[1]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[2]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Tao Mei,et al.  Jointly Localizing and Describing Events for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[8]  Pratik Rane,et al.  Self-Critical Sequence Training for Image Captioning , 2018 .

[9]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.

[10]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Tao Mei,et al.  MSR Asia MSM at ActivityNet Challenge 2017: Trimmed Action Recognition, Temporal Action Proposals and Dense-Captioning Events in Videos , 2017 .

[12]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[13]  Siqi Liu,et al.  Optimization of image description metrics using policy gradient methods , 2016, ArXiv.

[14]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Tao Mei,et al.  Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Tao Mei,et al.  Seeing Bot , 2017, SIGIR.