论文信息 - On Semantic Similarity in Video Retrieval

On Semantic Similarity in Video Retrieval

Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa. We demonstrate that this assumption results in performance comparisons often not indicative of models’ retrieval capabilities. We propose a move to semantic similarity video retrieval, where (i) multiple videos/captions can be deemed equally relevant, and their relative ranking does not affect a method’s reported performance and (ii) retrieved videos/captions are ranked by their similarity to a query. We propose several proxies to estimate semantic similarities in large-scale retrieval datasets, without additional annotations. Our analysis is performed on three commonly used video retrieval datasets (MSR-VTT, YouCook2 and EPIC-KITCHENS).

Dima Damen | Hazel Doughty | Michael Wray

[1] Rahul Sukthankar,et al. Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[2] Yi Li,et al. REPAIR: Removing Representation Bias by Dataset Resampling , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Guiguang Ding,et al. Cross-Modal Image-Text Retrieval with Semantic Consistency , 2019, ACM Multimedia.

[7] Xirong Li,et al. Hybrid Space Learning for Language-based Video Retrieval , 2020, ArXiv.

[8] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[9] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[11] James Glass,et al. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2021, Interspeech 2021.

[12] Yash Goyal,et al. Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Ser-Nam Lim,et al. A Metric Learning Reality Check , 2020, ECCV.

[14] Seong Joon Oh,et al. Probabilistic Embeddings for Cross-Modal Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Juan Carlos Niebles,et al. Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[16] Stan Sclaroff,et al. Deep Metric Learning to Rank , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[18] Chen Gao,et al. Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition , 2019, NeurIPS.

[19] Siwei Lyu,et al. Mercer kernels for object recognition with local features , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20] Michael E. Lesk,et al. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[21] Amit K. Roy-Chowdhury,et al. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.

[22] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[23] Dima Damen,et al. Action Modifiers: Learning From Adverbs in Instructional Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Dima Damen,et al. Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[26] Bernard Ghanem,et al. Diagnosing Error in Temporal Action Detectors , 2018, ECCV.

[27] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28] G. Huston. The End of End to End ? , 2008 .

[29] Shizhe Chen,et al. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Xirong Li,et al. Predicting Visual Features From Text for Image and Video Caption Retrieval , 2017, IEEE Transactions on Multimedia.

[31] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32] Tao Qin,et al. A general approximation framework for direct optimization of information retrieval measures , 2010, Information Retrieval.

[33] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[34] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Yi Yang,et al. Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[36] Matt J. Kusner,et al. From Word Embeddings To Document Distances , 2015, ICML.

[37] Albert Gordo,et al. Beyond Instance-Level Image Retrieval: Leveraging Captions to Learn a Global Visual Representation for Semantic Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Jonathon S. Hare,et al. Facing the reality of semantic image retrieval , 2007, J. Documentation.

[39] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40] Naokazu Yokoya,et al. Learning Joint Representations of Videos and Sentences with Web Image Search , 2016, ECCV Workshops.

[41] Timo Ojala,et al. Semantic image retrieval with hsv correlograms , 2001 .

[42] Filip Radlinski,et al. Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[43] Jon Almazán,et al. Learning With Average Precision: Training Image Retrieval With a Listwise Loss , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44] Andrew Zisserman,et al. QuerYD: A video dataset with high-quality textual and audio narrations , 2020, ArXiv.

[45] Andrew Zisserman,et al. Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval , 2020, ECCV.

[46] Leonid Sigal,et al. Learning Language-Visual Embedding for Movie Understanding with Natural-Language , 2016, ArXiv.

[47] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48] Ivan Laptev,et al. Deep Metric Learning Beyond Binary Supervision , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Dima Damen,et al. Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[51] Aude Oliva,et al. Global semantic classification of scenes using power spectrum templates , 1999 .

[52] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53] D. Damen,et al. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2020, International Journal of Computer Vision.

[54] Qi Wu,et al. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55] Ivan Laptev,et al. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.

[56] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.

[57] Rahul Sukthankar,et al. Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[58] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[59] Mohit Bansal,et al. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.

[60] Andrew Zisserman,et al. Condensed Movies: Story Based Retrieval with Contextual Embeddings , 2020, ACCV.

[61] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[62] Florian Metze,et al. Support-set bottlenecks for video-text representation learning , 2020, ICLR.

[63] Bernard Ghanem,et al. The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020) , 2020, ArXiv.

[64] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[65] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[66] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[67] Wei Chen,et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[68] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[69] Mingrui Wu,et al. Gradient descent optimization of smoothed information retrieval metrics , 2010, Information Retrieval.

[70] Joachim Denzler,et al. Hierarchy-Based Image Embeddings for Semantic Image Retrieval , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[71] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72] Gunhee Kim,et al. A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[73] Yang Liu,et al. Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.

[74] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[75] Esa Rahtu,et al. Uncovering Hidden Challenges in Query-Based Video Moment Retrieval , 2020, BMVC.