Temporal aggregation for large-scale query-by-image video retrieval

We address the challenge of using image queries to retrieve video clips from a large database. Using binarized Fisher Vectors as global signatures, we present three novel contributions. First, an asymmetric comparison scheme for binarized Fisher Vectors is shown to boost retrieval performance by 0.27 mean Average Precision, exploiting the fact that query images contain much less clutter than database videos. Second, aggregation of frame-based local features over shots is shown to achieve retrieval performance comparable to aggregation of those local features over single frames, while reducing retrieval latency and memory requirements by more than 3X. Several shot aggregation strategies are compared and results indicate that most perform equally well. Third, aggregation over scenes, in combination with shot signatures, is shown to achieve one order of magnitude faster retrieval at comparable performance. Scene aggregation also outperforms the recently proposed aggregation in random groups.

[1]  Wen Gao,et al.  Compact Descriptors for Visual Search , 2014, IEEE MultiMedia.

[2]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[3]  Huizhong Chen,et al.  Efficient video search using image queries , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[4]  Shin'ichi Satoh,et al.  Multi-image aggregation for better visual object retrieval , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hervé Jégou,et al.  A Group Testing Framework for Similarity Search in High-dimensional Spaces , 2014, ACM Multimedia.

[6]  Ba Tu Truong,et al.  Scene extraction in motion pictures , 2003, IEEE Trans. Circuits Syst. Video Technol..

[7]  Andrei Bursuc,et al.  ARTEMIS at TRECVID 2013: Instance Search Task , 2013, TRECVID.

[8]  Shin'ichi Satoh,et al.  Query-Adaptive Asymmetrical Dissimilarities for Visual Object Retrieval , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[10]  Mubarak Shah,et al.  Scene detection in Hollywood movies and TV shows , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[11]  Boon-Lock Yeo,et al.  Segmentation of Video by Clustering and Graph Analysis , 1998, Comput. Vis. Image Underst..

[12]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  Arnold W. M. Smeulders,et al.  Locality in Generic Instance Search from One Example , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Hervé Glotin,et al.  IRIM at TRECVID 2014: Semantic Indexing and Instance Search , 2014, TRECVID.

[15]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Bernd Girod,et al.  Stanford I2V: a news video dataset for query-by-image experiments , 2015, MMSys.

[17]  Bernd Girod,et al.  Residual enhanced visual vector as a compact signature for mobile visual search , 2013, Signal Process..