ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning

In this paper we introduce ViSiL, a Video Similarity Learning architecture that considers fine-grained Spatio-Temporal relations between pairs of videos -- such relations are typically lost in previous video retrieval approaches that embed the whole frame or even the whole video into a vector descriptor before the similarity estimation. By contrast, our Convolutional Neural Network (CNN)-based approach is trained to calculate video-to-video similarity from refined frame-to-frame similarity matrices, so as to consider both intra- and inter-frame relations. In the proposed method, pairwise frame similarity is estimated by applying Tensor Dot (TD) followed by Chamfer Similarity (CS) on regional CNN frame features - this avoids feature aggregation before the similarity calculation between frames. Subsequently, the similarity matrix between all video frames is fed to a four-layer CNN, and then summarized using Chamfer Similarity (CS) into a video-to-video similarity score -- this avoids feature aggregation before the similarity calculation between videos and captures the temporal similarity patterns between matching frame sequences. We train the proposed network using a triplet loss scheme and evaluate it on five public benchmark datasets on four different video retrieval problems where we demonstrate large improvements in comparison to the state of the art. The implementation of ViSiL is publicly available.

[1]  Shin'ichi Satoh,et al.  Temporal Matching Kernel with Explicit Feature Maps , 2015, ACM Multimedia.

[2]  Nanning Zheng,et al.  ER3: A Unified Framework for Event Retrieval, Recognition and Recounting , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jiwen Lu,et al.  Deep Video Hashing , 2017, IEEE Transactions on Multimedia.

[4]  Cordelia Schmid,et al.  Event Retrieval in Large Video Collections with Circulant Temporal Encoding , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Yiannis Kompatsiaris,et al.  Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers , 2017, MMM.

[6]  Matthijs Douze,et al.  LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Zi Huang,et al.  Near-duplicate video retrieval: Current research and future trends , 2013, CSUR.

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Yang Feng,et al.  Video Re-localization , 2018, ECCV.

[11]  Chong-Wah Ngo,et al.  Practical elimination of near-duplicates from web video search , 2007, ACM Multimedia.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Chien-Li Chou,et al.  Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale Videos , 2015, IEEE Transactions on Multimedia.

[14]  Meng Wang,et al.  Self-Supervised Video Hashing With Hierarchical Binary Auto-Encoder , 2018, IEEE Transactions on Image Processing.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Victor S. Lempitsky,et al.  Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Hung-Khoon Tan,et al.  Scalable detection of partial near-duplicate videos by visual-temporal consistency , 2009, ACM Multimedia.

[18]  Robert C. Bolles,et al.  Parametric Correspondence and Chamfer Matching: Two New Techniques for Image Matching , 1977, IJCAI.

[19]  Yiannis Kompatsiaris,et al.  Near-Duplicate Video Retrieval with Deep Metric Learning , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[20]  Hervé Jégou,et al.  Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening , 2012, ECCV.

[21]  Yongxin Yang,et al.  Deep Multi-task Representation Learning: A Tensor Factorisation Approach , 2016, ICLR.

[22]  Meng Wang,et al.  Unsupervised t-Distributed Video Hashing and Its Deep Hashing Extension , 2017, IEEE Transactions on Image Processing.

[23]  Guangfeng Lin,et al.  IR Feature Embedded BOF Indexing Method for Near-Duplicate Video Retrieval , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Fei Wang,et al.  Million-scale near-duplicate video retrieval system , 2011, ACM Multimedia.

[25]  Jiajun Wang,et al.  Partial Copy Detection in Videos: A Benchmark and an Evaluation of Popular Methods , 2016, IEEE Transactions on Big Data.

[26]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[27]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Cordelia Schmid,et al.  Stable Hyper-pooling and Query Expansion for Event Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  Cordelia Schmid,et al.  An Image-Based Approach to Video Copy Detection With Spatio-Temporal Post-Filtering , 2010, IEEE Transactions on Multimedia.

[30]  Hao Wang,et al.  An image-based near-duplicate video retrieval and localization using improved Edit distance , 2017, Multimedia Tools and Applications.

[31]  Jiajun Wang,et al.  VCDB: A Large-Scale Database for Partial Copy Detection in Videos , 2014, ECCV.

[32]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ioannis Patras,et al.  FIVR: Fine-Grained Incident Video Retrieval , 2018, IEEE Transactions on Multimedia.

[34]  Zi Huang,et al.  Multiple feature hashing for real-time large scale near-duplicate video retrieval , 2011, ACM Multimedia.

[35]  Xiaobo Lu,et al.  Learning spatial-temporal features for video copy detection by the combination of CNN and RNN , 2018, J. Vis. Commun. Image Represent..

[36]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.