With our submission to the 2017 TRECVID Instance Search task (Awad et al., 2017b), we focused on the usage of dedicated CNN models. We limited the available video training data to only three sources: The EastEnders episode 0, the given location examples and the given person examples. Our main workflow consists of three steps: First, we identify and crop persons from the training videos. Secondly, we train person and location classifiers using CNNs. Finally, we apply an ensemble strategy with prediction pooling to find matches for the given topics. In this contribution, we will provide some insights into our strategies and discuss results. Additionally, we present a novel way of interactive result annotation using HTC Vive VR headsets. 1 Structured Abstract 1. Briefly, list all the different sources of training data used in the creation of your system and its components. • We used the given master shot reference, the first episode with ID 0 (also denoted as DEV0 in this contribution) from the provided BBC EastEnders video footage as well as the location and person video examples. Additionally, we used textual metadata crawled from the BBC website containing episode descriptions. No other external training data was used. 2. Briefly, what approach or combination of approaches did you test in each of your submitted runs? • F E TUC HSMW 1: Dedicated model ensemble for person and location classification. The results of this run are re-ranked by similarity group scores. • F E TUC HSMW 2: Dedicated model ensemble like our first run, no re-ranking. • F E TUC HSMW 3: Faster-RCNN for person detection and classification. This is our best system from 2016. • I E TUC HSMW 4: Our only interactive run. Result re-ranking of our first run. We used a novel VR environment for the annotation process. Correspondence to: Stefan Kahl stefan.kahl@informatik.tu-chemnitz.de 3. What if any significant differences (in terms of what measures) did you find among the runs? • In contrast to last year’s submission, similarity group scores did not benefit the results. • We managed to retrieve 20% more relevant shots compared to our 2016 system. • Interactive re-ranking of results did not boost the overall scores as much as we expected. 4. Based on the results, can you estimate the relative contribution of each component of your system/approach to its effectiveness? • Person and location detection can be done using small, dedicated datasets for a closed domain environment like the EastEnders universe. • Simple CNN architectures are easy to train on small data sets but lack high generalization performance. • CNN ensembles perform significantly better than single models. • Interactive VR environments are suitable for interactive result-re-ranking but require more time than traditional annotation tools. 5. Overall, what did you learn about runs/approaches and the research question(s) that motivated them? 2 Kahl et. al.: TUC HSMW at TRECVID Instance Search 2017 • Increasing the generalization performance on small training sets is very challenging and requires extensive hyper-parameter tuning. • Different learning strategies aside from classification tasks might improve the performance (e.g. matching images of persons for similarities). • Ensemble strategies are strong and render some other techniques (e.g. similarity clustering) obsolete. The remainder of the paper is organized as follows: First, we provide a short workflow overview in section 2. After that, section 3 contains some insights into our training dataset. In section 4 we give a summary of our training process using artificial neural networks. Section 5 presents a novel approach for interactive evaluation on the EastEnders dataset. Finally, in section 6 we discuss the results of our submission.
[1]
Mahadev Satyanarayanan,et al.
OpenFace: A general-purpose face recognition library with mobile applications
,
2016
.
[2]
Geoffrey E. Hinton,et al.
Rectified Linear Units Improve Restricted Boltzmann Machines
,
2010,
ICML.
[3]
Sergey Ioffe,et al.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
,
2015,
ICML.
[4]
Kaiming He,et al.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
,
2015,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[5]
Jimmy Ba,et al.
Adam: A Method for Stochastic Optimization
,
2014,
ICLR.
[6]
Stefan Kahl,et al.
Large-Scale Bird Sound Classification using Convolutional Neural Networks
,
2017,
CLEF.
[7]
Georges Quénot,et al.
TRECVID 2017: Evaluating Ad-hoc and Instance Video Search, Events Detection, Video Captioning and Hyperlinking
,
2017,
TRECVID.
[8]
John Salvatier,et al.
Theano: A Python framework for fast computation of mathematical expressions
,
2016,
ArXiv.
[9]
Jian Sun,et al.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
,
2015,
2015 IEEE International Conference on Computer Vision (ICCV).
[10]
Hussein Hussein,et al.
Technische Universität Chemnitz at TRECVID Instance Search 2015
,
2014,
TRECVID.
[11]
Paul Over,et al.
Instance search retrospective with focus on TRECVID
,
2017,
International Journal of Multimedia Information Retrieval.