Vectors of locally aggregated centers for compact video representation

We propose a novel vector aggregation technique for compact video representation, with application in accurate similarity detection within large video datasets. The current state-of-the-art in visual search is formed by the vector of locally aggregated descriptors (VLAD) of Jegou et al. VLAD generates compact video representations based on scale-invariant feature transform (SIFT) vectors (extracted per frame) and local feature centers computed over a training set. With the aim to increase robustness to visual distortions, we propose a new approach that operates at a coarser level in the feature representation. We create vectors of locally aggregated centers (VLAC) by first clustering SIFT features to obtain local feature centers (LFCs) and then encoding the latter with respect to given centers of local feature centers (CLFCs), extracted from a training set. The sum-of-differences between the LFCs and the CLFCs are aggregated to generate an extremely-compact video description used for accurate video segment similarity detection. Experimentation using a video dataset, comprising more than 1000 minutes of content from the Open Video Project, shows that VLAC obtains substantial gains in terms of mean Average Precision (mAP) against VLAD and the hyper-pooling method of Douze et al., under the same compaction factor and the same set of distortions.

[1]  Avideh Zakhor,et al.  Estimation of Web video multiplicity , 1999, Electronic Imaging.

[2]  Cordelia Schmid,et al.  Event Retrieval in Large Video Collections with Circulant Temporal Encoding , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[4]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[6]  Chien-Li Chou,et al.  Near-duplicate video retrieval and localization using pattern set based dynamic programming , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[7]  Vasudev Bhaskaran,et al.  Spatiotemporal sequence matching for efficient video copy detection , 2005, IEEE Trans. Circuits Syst. Video Technol..

[8]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[9]  Cordelia Schmid,et al.  Stable Hyper-pooling and Query Expansion for Event Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Jian Lu,et al.  Video fingerprinting for copy identification: from research to industry applications , 2009, Electronic Imaging.

[11]  Nicu Sebe,et al.  Large-scale image and video search: Challenges, technologies, and trends , 2010, J. Vis. Commun. Image Represent..

[12]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Rabab Kreidieh Ward,et al.  A Robust and Fast Video Copy Detection System Using Content-Based Fingerprinting , 2011, IEEE Transactions on Information Forensics and Security.

[14]  Milind R. Naphade,et al.  Novel scheme for fast and efficent video sequence matching using compact signatures , 1999, Electronic Imaging.

[15]  Ruud M. Bolle,et al.  Comparison of sequence matching techniques for video copy detection , 2001, IS&T/SPIE Electronic Imaging.

[16]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.