UQMG @ TRECVID 2015: Instance Search

The UQMG group submits three runs for instance search at TRECVid 2015 [13]: all of them are automatic runs. Instead of adopting the traditional retrieval approach, e.g., Bag-of-Visual-Word (BoVW), our approach consists of three major steps: video decomposition, feature extraction and indexing. During decomposition, video segmentation is applied and various objects are extracted. Here a visual object is a minimal unit, and a video might consists of thousands of objects. Then we extract the visual feature of the object by using a convolutional neural network (ConvNet), which is a high-dimensional vector outputted by a fully connected layer of the network. Finally, the instance search problem is treated as finding the approximate nearest neighbors (ANN) of a given query in a large set of data points in high-dimensional space. Our best mAP is 0.114.

[1]  Atsuto Maki,et al.  Visual Instance Retrieval with Deep Convolutional Networks , 2014, ICLR.

[2]  Kunio Kashino,et al.  BM25 With Exponential IDF for Instance Search , 2014, IEEE Transactions on Multimedia.

[3]  Thomas Brox,et al.  A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  Thomas Brox,et al.  Spectral Graph Reduction for Efficient Image and Streaming Video Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Bernt Schiele,et al.  Classifier based graph construction for video segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Mei Han,et al.  Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[8]  Shin'ichi Satoh,et al.  Large vocabulary quantization for searching instances from videos , 2012, ICMR '12.

[9]  Atsuto Maki,et al.  From generic to specific deep representations for visual recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[11]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[12]  Jie Lin,et al.  DeepHash: Getting Regularization, Depth and Fine-Tuning Right , 2015, ArXiv.

[13]  David Stutz,et al.  Neural Codes for Image Retrieval , 2015 .

[14]  Shin'ichi Satoh,et al.  Query-Adaptive Asymmetrical Dissimilarities for Visual Object Retrieval , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Shuicheng Yan,et al.  SOLD: Sub-optimal low-rank decomposition for efficient video segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.