Temporal Context Aggregation for Video Retrieval with Contrastive Learning

The current research focus on Content-Based Video Retrieval requires higher-level video representation describing the long-range semantic dependencies of relevant incidents, events, etc. However, existing methods commonly process the frames of a video as individual images or short clips, making the modeling of long-range semantic dependencies difficult. In this paper, we propose TCA (Temporal Context Aggregation for Video Retrieval), a video representation learning framework that incorporates long-range temporal information between frame-level features using the self-attention mechanism. To train it on video retrieval datasets, we propose a supervised contrastive learning method that performs automatic hard negative mining and utilizes the memory bank mechanism to increase the capacity of negative samples. Extensive experiments are conducted on multiple video retrieval tasks, such as CC_WEB_VIDEO, FIVR-200K, and EVVE. The proposed method shows a significant performance advantage (~17% mAP on FIVR-200K) over state-of-the-art methods with video-level features, and deliver competitive results with 22x faster inference time comparing with frame-level features.

[1]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[2]  Hao Wang,et al.  An image-based near-duplicate video retrieval and localization using improved Edit distance , 2017, Multimedia Tools and Applications.

[3]  Hung-Khoon Tan,et al.  Scalable detection of partial near-duplicate videos by visual-temporal consistency , 2009, ACM Multimedia.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[6]  Jinchao Xia,et al.  Weakly Supervised EM Process For Temporal Localization Within Video , 2019 .

[7]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[8]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[9]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Qi Tian,et al.  SIFT Meets CNN: A Decade Survey of Instance Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[12]  Victor S. Lempitsky,et al.  Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Jiajun Wang,et al.  VCDB: A Large-Scale Database for Partial Copy Detection in Videos , 2014, ECCV.

[15]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[16]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[17]  Fei Wang,et al.  Million-scale near-duplicate video retrieval system , 2011, ACM Multimedia.

[18]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[19]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[20]  Matthijs Douze,et al.  LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Zi Huang,et al.  Effective Multiple Feature Hashing for Large-Scale Near-Duplicate Video Retrieval , 2013, IEEE Transactions on Multimedia.

[22]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[23]  Yang Feng,et al.  Video Re-localization , 2018, ECCV.

[24]  Meng Wang,et al.  Unsupervised t-Distributed Video Hashing and Its Deep Hashing Extension , 2017, IEEE Transactions on Image Processing.

[25]  Meng Wang,et al.  Stochastic Multiview Hashing for Large-Scale Near-Duplicate Video Retrieval , 2017, IEEE Transactions on Multimedia.

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Guangfeng Lin,et al.  IR Feature Embedded BOF Indexing Method for Near-Duplicate Video Retrieval , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[29]  Qi Tian,et al.  Good Practice in CNN Feature Transfer , 2016, ArXiv.

[30]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[31]  Yiannis Kompatsiaris,et al.  Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers , 2017, MMM.

[32]  Ondrej Chum,et al.  CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples , 2016, ECCV.

[33]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Albert Gordo,et al.  End-to-End Learning of Deep Visual Representations for Image Retrieval , 2016, International Journal of Computer Vision.

[35]  Zi Huang,et al.  Multiple feature hashing for real-time large scale near-duplicate video retrieval , 2011, ACM Multimedia.

[36]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[37]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Cordelia Schmid,et al.  Stable Hyper-pooling and Query Expansion for Event Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Ioannis Patras,et al.  FIVR: Fine-Grained Incident Video Retrieval , 2018, IEEE Transactions on Multimedia.

[40]  Xiaobo Lu,et al.  Learning spatial-temporal features for video copy detection by the combination of CNN and RNN , 2018, J. Vis. Commun. Image Represent..

[41]  Yulong Xu,et al.  MS-RMAC: Multiscale Regional Maximum Activation of Convolutions for Image Retrieval , 2017, IEEE Signal Processing Letters.

[42]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[44]  Yiannis Kompatsiaris,et al.  Near-Duplicate Video Retrieval with Deep Metric Learning , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[45]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[46]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[47]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Stéphane Dupont,et al.  Towards Good Practices for Image Retrieval Based on CNN Features , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[49]  Chong-Wah Ngo,et al.  Practical elimination of near-duplicates from web video search , 2007, ACM Multimedia.

[50]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[51]  Gongping Yang,et al.  Global-view hashing: harnessing global relations in near-duplicate video retrieval , 2018, World Wide Web.

[52]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Yiannis Kompatsiaris,et al.  ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Atsuto Maki,et al.  Visual Instance Retrieval with Deep Convolutional Networks , 2014, ICLR.

[55]  Yichen Wei,et al.  Circle Loss: A Unified Perspective of Pair Similarity Optimization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Hervé Jégou,et al.  Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening , 2012, ECCV.

[57]  Cordelia Schmid,et al.  Event Retrieval in Large Video Collections with Circulant Temporal Encoding , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[59]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Fei Wang,et al.  Real-time large scale near-duplicate web video retrieval , 2010, ACM Multimedia.

[61]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Bingchen Zhao,et al.  Distilling Visual Priors from Self-Supervised Learning , 2020, ECCV Workshops.

[63]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[64]  Kiyoharu Aizawa,et al.  Self-similarity-based partial near-duplicate video retrieval and alignment , 2013, International Journal of Multimedia Information Retrieval.