The task of retrieving relevant videos with natural language queries plays a critical role in effectively indexing large-scale video data. In this report, we present a framework based on a multi-modal transformer architecture, which jointly encodes the different modalities in video, and allows them to attend to each other. The transformer architecture is also leveraged to encode and model the temporal information. This novel framework allowed us to achieve the top result of the CVPR 2020 video pentathlon challenge. More details are available at http: //thoth.inrialpes.fr/research/MMT. 1. The Video Pentathlon Challenge In this report, we present the method that we implemented for the CVPR 2020 video pentathlon challenge. This challenge tackles the task of caption-to-video retrieval. Given a query in the form of a caption, the goal is to retrieve the videos best described by it. The challenge considers 5 datasets: ActivityNet, DiDeMo, MSRVTT, MSVD and YouCook2. For each dataset, our task is to provide, for each caption query of the test set, a ranking of all the test video candidates such that the video associated with the caption query is ranked as high as possible.
[1]
Liyuan Liu,et al.
On the Variance of the Adaptive Learning Rate and Beyond
,
2019,
ICLR.
[2]
Yang Liu,et al.
Use What You Have: Video retrieval using representations from collaborative experts
,
2019,
BMVC.
[3]
Geoffrey E. Hinton,et al.
Lookahead Optimizer: k steps forward, 1 step back
,
2019,
NeurIPS.
[4]
Ming-Wei Chang,et al.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
,
2019,
NAACL.
[5]
Ion Stoica,et al.
Tune: A Research Platform for Distributed Model Selection and Training
,
2018,
ArXiv.
[6]
Ivan Laptev,et al.
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
,
2018,
ArXiv.
[7]
Ameet Talwalkar,et al.
Massively Parallel Hyperparameter Tuning
,
2018,
ArXiv.
[8]
Lukasz Kaiser,et al.
Attention is All you Need
,
2017,
NIPS.