GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
暂无分享,去创建一个
Mike Zheng Shou | Stan Weixian Lei | Matt Feiszli | Yuxuan Wang | Difei Gao | Licheng Yu | Mike Zheng Shou
[1] Mike Zheng Shou,et al. On Pursuit of Designing Multi-modal Transformer for Video Grounding , 2021, EMNLP.
[2] Fan Yang,et al. Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss , 2021, ArXiv.
[3] Ping Luo,et al. End-to-End Dense Video Captioning with Parallel Decoding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[4] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[5] Jes'us Andr'es Portillo-Quintero,et al. A Straightforward Framework For Video Retrieval Using CLIP , 2021, MCPR.
[6] Weiyao Wang,et al. Generic Event Boundary Detection: A Benchmark for Event Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[7] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Bohyung Han,et al. Local-Global Video-Text Interactions for Temporal Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Runhao Zeng,et al. Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Esa Rahtu,et al. Multi-modal Dense Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[11] Xilin Chen,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.
[12] Jiebo Luo,et al. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.
[13] Iryna Gurevych,et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.
[14] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[15] Trevor Darrell,et al. Robust Change Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[16] Ramakant Nevatia,et al. MAC: Mining Activity Concepts for Language-Based Temporal Localization , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).
[17] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.
[18] Harsh Jhamtani,et al. Learning to Describe Differences Between Pairs of Similar Images , 2018, EMNLP.
[19] Tao Mei,et al. Jointly Localizing and Describing Events for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[20] Tao Mei,et al. To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.
[21] Gang Li,et al. Change Detection in Heterogenous Remote Sensing Images via Homogeneous Pixel Transformation , 2018, IEEE Transactions on Image Processing.
[22] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[24] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[25] Limin Wang,et al. Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[26] Ramakant Nevatia,et al. TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[27] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[28] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[29] Stephen Gould,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.
[30] Germán Ros,et al. Street-view change detection with deconvolutional networks , 2016, Autonomous Robots.
[31] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[34] Terrance E. Boult,et al. Towards Open World Recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[36] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Bernt Schiele,et al. Grounding Action Descriptions in Videos , 2013, TACL.
[38] Jeffrey M. Zacks,et al. Event perception , 2011, Scholarpedia.
[39] David L. Chen,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[40] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[41] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[42] Tony Lindeberg,et al. Feature Detection with Automatic Scale Selection , 1998, International Journal of Computer Vision.
[43] Shiyong Cui,et al. Building Change Detection Based on Satellite Stereo Imagery and Digital Surface Models , 2014, IEEE Transactions on Geoscience and Remote Sensing.