This paper presents our system developed for Adhoc Video Search (AVS) task in TRECVID 2019. Our system is based on embedding that maps visual and textual information into a common space to measure the relevance of each shot to a topic. We devise three embedding models built on two sources of training data, MS-COCO [1] and Flickr 30k [2]. Image feature extractors and region detector internally used in these models are pre-trained on ImageNet [3] and Visual Genome [4], respectively. The following five variants of our system were submitted: 1) F M C D kindai kobe.19 1: This run is an ensemble of three embedding models. The first and second models are respectively trained on MS-COCO and Flickr 30k to perform different coarse-grained embeddings between frames and a topic. The last model forms fine-grained embedding between regions in frames and words in a topic. 2) F M C D kindai kobe.19 2: This run is the same to F M C D kindai kobe.19 1 except that the fine-grained embedding model normalises regional features. 3) F M C D kindai kobe.19 3: This run only uses the finegrained embedding model without the normalisation. 4) F M C D kindai kobe.19 4: This run is an ensemble of only the two coarse-grained embedding models. 5) F M N D kindai kobe.19 5: This run is the same to F M C D kindai kobe.19 3 except the fine-grained embedding model using the normalisation. The MAPs of F M C D kindai kobe.19 3, F M C D kindai kobe.19 4 and F M N D kindai kobe.19 5 are 0.080, 0.059 and 0.081, respectively. This indicates that fine-grained embedding is much more effective than coarse-grained one. Considering that both F M C D kindai kobe.19 1’s and F M C D kindai kobe.19 2’s MAPs are 0.087, the ensemble of coarse-grained and fine-grained embeddings leads to small performance improvements, compared to F M C D kindai kobe.19 3 and F M C D kindai kobe.19 5. This means that the performances of the former are mainly owing to the latter. Finally, F M C D kindai kobe.19 1 and F M C D kindai kobe.19 2 are ranked at the fifth position in terms of teams participating in the fully automatic category, and our runs achieve the best MAPs for three topics in this category.
[1]
Jonathan G. Fiscus,et al.
TRECVID 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & retrieval
,
2019,
TRECVID.
[2]
Jian Sun,et al.
Deep Residual Learning for Image Recognition
,
2015,
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3]
Michael S. Bernstein,et al.
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
,
2016,
International Journal of Computer Vision.
[4]
Peter Young,et al.
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
,
2014,
TACL.
[5]
Lei Zhang,et al.
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
,
2017,
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[6]
Kimiaki Shirahama,et al.
Kobe University and Kindai University at TRECVID 2018 AVS Task
,
2018,
TRECVID.
[7]
Michael S. Bernstein,et al.
ImageNet Large Scale Visual Recognition Challenge
,
2014,
International Journal of Computer Vision.
[8]
David J. Fleet,et al.
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
,
2017,
BMVC.
[9]
Pietro Perona,et al.
Microsoft COCO: Common Objects in Context
,
2014,
ECCV.
[10]
Kimiaki Shirahama,et al.
Kobe University, NICT and University of Siegen at TRECVID 2017 AVS Task
,
2017,
TRECVID.
[11]
Xi Chen,et al.
Stacked Cross Attention for Image-Text Matching
,
2018,
ECCV.