Context Encoding for Video Retrieval with Contrastive Learning

Content-based video retrieval plays an important role in areas such as video recommendation, copyright protection, etc. Existing video retrieval methods mainly extract frame-level features independently, therefore lack of efficient aggregation of features between frames, and it is difficult to effectively deal with poor quality frames, such as frames with motion blur, out of focus, etc. In this paper, we propose CECL (Context Encoding for video retrieval with Contrastive Learning), a video representation learning framework that aggregates the context information of frame-level descriptors, and a supervised contrastive learning method that performs automatic hard negative mining, and utilizes the memory bank mechanism to increase the capacity of negative samples. Extensive experiments are conducted on multi video retrieval tasks, such as FIVR, CC_WEB_VIDEO and EVVE. The proposed method shows a significant performance advantage (~17% mAP on FIVR-200K) over state-of-the-art methods with video-level features, and deliver competitive results with much lower computational cost when compared with frame-level features.

[1]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Qi Tian,et al.  Good Practice in CNN Feature Transfer , 2016, ArXiv.

[3]  Ioannis Patras,et al.  FIVR: Fine-Grained Incident Video Retrieval , 2018, IEEE Transactions on Multimedia.

[4]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jianping Fan,et al.  NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification , 2018, ECCV Workshops.

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Cordelia Schmid,et al.  Stable Hyper-pooling and Query Expansion for Event Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Silvio Savarese,et al.  Deep Metric Learning via Lifted Structured Feature Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yiannis Kompatsiaris,et al.  Near-Duplicate Video Retrieval with Deep Metric Learning , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[10]  Yulong Xu,et al.  MS-RMAC: Multiscale Regional Maximum Activation of Convolutions for Image Retrieval , 2017, IEEE Signal Processing Letters.

[11]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Hung-Khoon Tan,et al.  Scalable detection of partial near-duplicate videos by visual-temporal consistency , 2009, ACM Multimedia.

[13]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Stéphane Dupont,et al.  Towards Good Practices for Image Retrieval Based on CNN Features , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[15]  Atsuto Maki,et al.  Visual Instance Retrieval with Deep Convolutional Networks , 2014, ICLR.

[16]  Kiyoharu Aizawa,et al.  Self-similarity-based partial near-duplicate video retrieval and alignment , 2013, International Journal of Multimedia Information Retrieval.

[17]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[18]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[21]  Shin'ichi Satoh,et al.  Temporal Matching Kernel with Explicit Feature Maps , 2015, ACM Multimedia.

[22]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[23]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Ondrej Chum,et al.  CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples , 2016, ECCV.

[25]  Qi Tian,et al.  SIFT Meets CNN: A Decade Survey of Instance Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Jinchao Xia,et al.  Weakly Supervised EM Process For Temporal Localization Within Video , 2019 .

[27]  Fei Wang,et al.  Real-time large scale near-duplicate web video retrieval , 2010, ACM Multimedia.

[28]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[29]  Meng Wang,et al.  Stochastic Multiview Hashing for Large-Scale Near-Duplicate Video Retrieval , 2017, IEEE Transactions on Multimedia.

[30]  Cordelia Schmid,et al.  Event Retrieval in Large Video Collections with Circulant Temporal Encoding , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Peng Li,et al.  Similarity Metric Learning for Face Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[33]  Albert Gordo,et al.  End-to-End Learning of Deep Visual Representations for Image Retrieval , 2016, International Journal of Computer Vision.

[34]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[35]  Hervé Jégou,et al.  Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening , 2012, ECCV.

[36]  Zi Huang,et al.  Multiple feature hashing for real-time large scale near-duplicate video retrieval , 2011, ACM Multimedia.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Victor S. Lempitsky,et al.  Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[42]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Xiaobo Lu,et al.  Learning spatial-temporal features for video copy detection by the combination of CNN and RNN , 2018, J. Vis. Commun. Image Represent..

[44]  Yiannis Kompatsiaris,et al.  ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Liang Zheng,et al.  Circle Loss: A Unified Perspective of Pair Similarity Optimization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[47]  Chen Wang,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[48]  Yiannis Kompatsiaris,et al.  Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers , 2017, MMM.

[49]  Matthijs Douze,et al.  LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[51]  Gongping Yang,et al.  Global-view hashing: harnessing global relations in near-duplicate video retrieval , 2018, World Wide Web.

[52]  Hao Wang,et al.  An image-based near-duplicate video retrieval and localization using improved Edit distance , 2017, Multimedia Tools and Applications.

[53]  Jiajun Wang,et al.  VCDB: A Large-Scale Database for Partial Copy Detection in Videos , 2014, ECCV.

[54]  Yang Hua,et al.  Ranked List Loss for Deep Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[57]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[58]  Meng Wang,et al.  Unsupervised t-Distributed Video Hashing and Its Deep Hashing Extension , 2017, IEEE Transactions on Image Processing.

[59]  Chong-Wah Ngo,et al.  Practical elimination of near-duplicates from web video search , 2007, ACM Multimedia.

[60]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Chien-Li Chou,et al.  Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale Videos , 2015, IEEE Transactions on Multimedia.

[62]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[63]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[64]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[65]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[66]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[67]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[68]  Zi Huang,et al.  Effective Multiple Feature Hashing for Large-Scale Near-Duplicate Video Retrieval , 2013, IEEE Transactions on Multimedia.

[69]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[70]  Fei Wang,et al.  Million-scale near-duplicate video retrieval system , 2011, ACM Multimedia.

[71]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[72]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.