PKU_ICST at TRECVID 2018: Instance Search Task

We participated in all two types of instance search (INS) task in TRECVID 2015: automatic search and interactive search. This paper presents our approaches and results. In this task, we mainly focused on exploring the effective feature representation, feature matching and re-ranking algorithm. In this year, we also tried to use Deep Neural Networks (DNN) to improve the results. In feature representation, we extracted two kinds of features: (1) Bag-of-Words (BoW) feature based on Approximate K-means (AKM) and (2) DNN feature based on Convolutional Neural Networks (CNN). In feature matching, we adopted different ranking methods to different features: (1) For the AKM-based BoW feature, we used cosine distance to calculate the similarity between each query topic and each shot; (2) For the DNN feature, multi-bag SVM (MBSVM) was adopted since it can make full use of all query examples. Moreover, we conducted keypoint matching algorithm on the top ranked results. It was effective yet efficient since only top ranked results were considered. In re-ranking stage, we further incorporated transcripts into our framework to explore the context information. The official evaluations showed that our team is ranked 1 st on both automatic search and interactive search.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Paul Over,et al.  TRECVID 2008 - Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2010, TRECVID.

[5]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[6]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[9]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[11]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[13]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[14]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[15]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Yongwei Zhu,et al.  TRECVID 2010 Known-item Search (KIS) Task by I2R , 2010, TRECVID.

[17]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[18]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[19]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wei Liu,et al.  BUPT-MCPRL at TRECVID 2012 , 2010, TRECVID.

[22]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[23]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[24]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[25]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.