Kindai University and Kobe University at TRECVID 2019 AVS Task

This paper presents our system developed for Adhoc Video Search (AVS) task in TRECVID 2019. Our system is based on embedding that maps visual and textual information into a common space to measure the relevance of each shot to a topic. We devise three embedding models built on two sources of training data, MS-COCO [1] and Flickr 30k [2]. Image feature extractors and region detector internally used in these models are pre-trained on ImageNet [3] and Visual Genome [4], respectively. The following five variants of our system were submitted: 1) F M C D kindai kobe.19 1: This run is an ensemble of three embedding models. The first and second models are respectively trained on MS-COCO and Flickr 30k to perform different coarse-grained embeddings between frames and a topic. The last model forms fine-grained embedding between regions in frames and words in a topic. 2) F M C D kindai kobe.19 2: This run is the same to F M C D kindai kobe.19 1 except that the fine-grained embedding model normalises regional features. 3) F M C D kindai kobe.19 3: This run only uses the finegrained embedding model without the normalisation. 4) F M C D kindai kobe.19 4: This run is an ensemble of only the two coarse-grained embedding models. 5) F M N D kindai kobe.19 5: This run is the same to F M C D kindai kobe.19 3 except the fine-grained embedding model using the normalisation. The MAPs of F M C D kindai kobe.19 3, F M C D kindai kobe.19 4 and F M N D kindai kobe.19 5 are 0.080, 0.059 and 0.081, respectively. This indicates that fine-grained embedding is much more effective than coarse-grained one. Considering that both F M C D kindai kobe.19 1’s and F M C D kindai kobe.19 2’s MAPs are 0.087, the ensemble of coarse-grained and fine-grained embeddings leads to small performance improvements, compared to F M C D kindai kobe.19 3 and F M C D kindai kobe.19 5. This means that the performances of the former are mainly owing to the latter. Finally, F M C D kindai kobe.19 1 and F M C D kindai kobe.19 2 are ranked at the fifth position in terms of teams participating in the fully automatic category, and our runs achieve the best MAPs for three topics in this category.

[1]  Jonathan G. Fiscus,et al.  TRECVID 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & retrieval , 2019, TRECVID.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[4]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[5]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Kimiaki Shirahama,et al.  Kobe University and Kindai University at TRECVID 2018 AVS Task , 2018, TRECVID.

[7]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[8]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Kimiaki Shirahama,et al.  Kobe University, NICT and University of Siegen at TRECVID 2017 AVS Task , 2017, TRECVID.

[11]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.