CVPR 2020 Video Pentathlon Challenge: Multi-modal Transformer for Video Retrieval

The task of retrieving relevant videos with natural language queries plays a critical role in effectively indexing large-scale video data. In this report, we present a framework based on a multi-modal transformer architecture, which jointly encodes the different modalities in video, and allows them to attend to each other. The transformer architecture is also leveraged to encode and model the temporal information. This novel framework allowed us to achieve the top result of the CVPR 2020 video pentathlon challenge. More details are available at http: //thoth.inrialpes.fr/research/MMT. 1. The Video Pentathlon Challenge In this report, we present the method that we implemented for the CVPR 2020 video pentathlon challenge. This challenge tackles the task of caption-to-video retrieval. Given a query in the form of a caption, the goal is to retrieve the videos best described by it. The challenge considers 5 datasets: ActivityNet, DiDeMo, MSRVTT, MSVD and YouCook2. For each dataset, our task is to provide, for each caption query of the test set, a ranking of all the test video candidates such that the video associated with the caption query is ranked as high as possible.