Waseda participated in the TRECVID 2016 Ad-hoc Video Search (AVS) task [1]. For the AVS task, we submitted four manually assisted runs. Our approach used the following processing steps: manually creating several search keywords based on the given query phrase, calculating a score for each concept using visual features, and combining the semantic concepts to obtain the final scores. Our best run achieved a mean Average Precision (mAP) of 17.7%. It was ranked the highest among all the runs submitted. 1 System Description Our method consists of three steps: 1. Manually select several search keywords based on the given query phrase (Subsection 1.1). 2. Calculate a score for each concept using visual features (Subsection 1.2). 3. Combine the semantic concepts to get the final scores (Subsection 1.3). 1.1 Manual search keyword selection Given a query phrase, we manually picked out some important keywords. For example, given the query phrase “any type of fountains outdoors”, we extracted the keywords “fountain” and “outdoor”. Here, we explicitly distinguished and from or; that is, given the query phrase “one or more people walking or bicycling on a bridge during daytime”, we created the new search query “people” and (“walking” or “bicycling”) and “bridge” and “daytime”. In this case, there is no need for a video to include both “walking” and “bicycling”; it is sufficient if one of these is included in the video. 1.2 Score calculation using visual features In our submission, we extracted visual features from pre-trained convolutional neural networks (CNNs). First, we selected at most 10 frames from each shot at regular intervals, and the corresponding images were input to the CNN to obtain the respective feature vectors from hidden or output layers. These (at most 10) feature vectors were then bound to one feature vector by element-wise max-pooling. We used a total of nine kinds of pre-trained models to calculate scores of concepts as shown in Table 1. 1. TRECVID346 We extracted 1,024-dimensional vectors from pool5 layers of the pre-trained GoogLeNet model [6], which was trained with the ImageNet database. Then we trained support vector machines (SVMs) for each concept using the annotation provided by collaborative annotation [2]. The shot score for each concept was calculated as the distance to the hyperplane in the SVM model.
[1]
Dumitru Erhan,et al.
Going deeper with convolutions
,
2014,
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2]
Bolei Zhou,et al.
Learning Deep Features for Scene Recognition using Places Database
,
2014,
NIPS.
[3]
Jonathan G. Fiscus,et al.
TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking
,
2016,
TRECVID.
[4]
Pietro Perona,et al.
Microsoft COCO: Common Objects in Context
,
2014,
ECCV.
[5]
Stéphane Ayache,et al.
Video Corpus Annotation Using Active Learning
,
2008,
ECIR.
[6]
Trevor Darrell,et al.
Caffe: Convolutional Architecture for Fast Feature Embedding
,
2014,
ACM Multimedia.
[7]
Dennis Koelma,et al.
The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection
,
2016,
ICMR.