Jointly Localizing and Describing Events for Dense Video Captioning
暂无分享,去创建一个
Tao Mei | Hongyang Chao | Ting Yao | Yehao Li | Yingwei Pan | Tao Mei | Ting Yao | Yingwei Pan | Yehao Li | Hongyang Chao
[1] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[3] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.
[4] Tao Mei,et al. MSR Asia MSM at ActivityNet Challenge 2017: Trimmed Action Recognition, Temporal Action Proposals and Dense-Captioning Events in Videos , 2017 .
[5] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Bernard Ghanem,et al. SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Tao Mei,et al. Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Tao Mei,et al. Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[10] R. Nevatia,et al. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[11] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.
[12] Kunio Fukunaga,et al. Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.
[13] Subhashini Venugopalan,et al. Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.
[14] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Bernt Schiele,et al. Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.
[16] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.
[17] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[18] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[19] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[20] Pratik Rane,et al. Self-Critical Sequence Training for Image Captioning , 2018 .
[21] Limin Wang,et al. Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.
[22] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[23] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[24] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[25] Jean Ponce,et al. Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.
[26] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Bernard Ghanem,et al. DAPs: Deep Action Proposals for Action Understanding , 2016, ECCV.
[28] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[29] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[30] Bernard Ghanem,et al. Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[31] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[32] Tao Mei,et al. Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Shih-Fu Chang,et al. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Xu Zhao,et al. Single Shot Temporal Action Detection , 2017, ACM Multimedia.
[35] Tao Mei,et al. Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).
[36] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[37] Bernard Ghanem,et al. ActivityNet Challenge 2017 Summary , 2017, ArXiv.
[38] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[39] Marc'Aurelio Ranzato,et al. Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.