DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval

In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) finegrained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, which we call Distill-and-Select (DnS), that starting from a wellperforming fine-grained Teacher Network learns: a) Student Networks at different retrieval performance and computational efficiency trade-offs and b) a Selection Network that at test time rapidly directs samples to the appropriate student so as to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store Giorgos Kordopatis-Zilos Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece, and Queen Mary University of London, Mile End road, E1 4NS London E-mail: georgekordopatis@iti.gr Christos Tzelepis · Ioannis Patras Queen Mary University of London, Mile End road, E1 4NS London E-mail: {c.tzelepis, i.patras}@qmul.ac.uk Symeon Papadopoulos · Ioannis Kompatsiaris Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece E-mail: {papadop, ikom}@iti.gr index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets – this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate a) that our students achieve state-of-the-art performance in several cases and b) that our DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, our method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. Our collected dataset and implementation are publicly available: https: //github.com/mever-team/distill-and-select.

[1]  Juan Carlos Niebles,et al.  Spatio-Temporal Graph for Video Captioning With Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Bing Li,et al.  Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Mitesh M. Khapra,et al.  Efficient Video Classification Using Fewer Frames , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[5]  Vittorio Murino,et al.  Modality Distillation with Multiple Stream Networks for Action Recognition , 2018, ECCV.

[6]  Chong-Wah Ngo,et al.  Practical elimination of near-duplicates from web video search , 2007, ACM Multimedia.

[7]  Fei Wang,et al.  Million-scale near-duplicate video retrieval system , 2011, ACM Multimedia.

[8]  Yang Feng,et al.  Video Re-localization , 2018, ECCV.

[9]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[10]  Vincent Gripon,et al.  Deep Geometric Knowledge Distillation with Graphs , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Haojie Li,et al.  Compact CNN Based Video Representation for Efficient Video Copy Detection , 2017, MMM.

[12]  Bing Li,et al.  Knowledge Distillation via Instance Relationship Graph , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Nanning Zheng,et al.  ER3: A Unified Framework for Event Retrieval, Recognition and Recounting , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Zi Huang,et al.  Multiple feature hashing for real-time large scale near-duplicate video retrieval , 2011, ACM Multimedia.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Cordelia Schmid,et al.  Incremental Learning of Object Detectors without Catastrophic Forgetting , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Albert Gordo,et al.  Attention-Based Query Expansion Learning , 2020, ECCV.

[18]  Hervé Jégou,et al.  Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening , 2012, ECCV.

[19]  Chen Sun,et al.  D3D: Distilled 3D Networks for Video Action Recognition , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Chien-Li Chou,et al.  Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale Videos , 2015, IEEE Transactions on Multimedia.

[22]  Tao Mei,et al.  Relation Distillation Networks for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Lanfen Lin,et al.  Unsupervised Teacher-Student Model for Large-Scale Video Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[25]  Greg Mori,et al.  Similarity-Preserving Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[27]  Shin'ichi Satoh,et al.  Temporal Matching Kernel with Explicit Feature Maps , 2015, ACM Multimedia.

[28]  Matthijs Douze,et al.  LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Zi Huang,et al.  Effective Multiple Feature Hashing for Large-Scale Near-Duplicate Video Retrieval , 2013, IEEE Transactions on Multimedia.

[30]  H. R. Tavakoli,et al.  AWSD: Adaptive Weighted Spatiotemporal Distillation for Video Representation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Wu-Jun Li,et al.  SVD: A Large-Scale Short Video Dataset for Near-Duplicate Video Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[33]  Yuxin Peng,et al.  Better and Faster: Knowledge Transfer from Multiple Self-supervised Learning Tasks via Graph Distillation for Video Classification , 2018, IJCAI.

[34]  Juergen Gall,et al.  Cross-Modal Knowledge Distillation for Action Recognition , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[35]  Jiashi Feng,et al.  Central Similarity Quantization for Efficient Image and Video Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jiwen Lu,et al.  Deep Video Hashing , 2017, IEEE Transactions on Multimedia.

[37]  Ping Wang,et al.  An Efficient Hierarchical Near-Duplicate Video Detection Algorithm Based on Deep Semantic Features , 2020, MMM.

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Philip S. Yu,et al.  HashNet: Deep Learning to Hash by Continuation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Kan Chen,et al.  Billion-scale semi-supervised learning for image classification , 2019, ArXiv.

[41]  Michael Isard,et al.  Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[42]  Guangfeng Lin,et al.  IR Feature Embedded BOF Indexing Method for Near-Duplicate Video Retrieval , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[43]  Yiannis Kompatsiaris,et al.  Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers , 2017, MMM.

[44]  Apostol Natsev,et al.  Collaborative Deep Metric Learning for Video Understanding , 2018, KDD.

[45]  Jiajun Wang,et al.  Partial Copy Detection in Videos: A Benchmark and an Evaluation of Popular Methods , 2016, IEEE Transactions on Big Data.

[46]  Jiajun Wang,et al.  VCDB: A Large-Scale Database for Partial Copy Detection in Videos , 2014, ECCV.

[47]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[48]  Meng Wang,et al.  Self-Supervised Video Hashing With Hierarchical Binary Auto-Encoder , 2018, IEEE Transactions on Image Processing.

[49]  Cordelia Schmid,et al.  MARS: Motion-Augmented RGB Stream for Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jianping Gou,et al.  Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.

[51]  Michael S. Ryoo,et al.  Evolving Losses for Unsupervised Video Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[53]  Yiannis Kompatsiaris,et al.  ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Joonseok Lee,et al.  Large Scale Video Representation Learning via Relational Graph Clustering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Hao Wang,et al.  An image-based near-duplicate video retrieval and localization using improved Edit distance , 2017, Multimedia Tools and Applications.

[56]  Hung-Khoon Tan,et al.  Scalable detection of partial near-duplicate videos by visual-temporal consistency , 2009, ACM Multimedia.

[57]  Ioannis Patras,et al.  TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition , 2019, BMVC.

[58]  Shang-Hong Lai,et al.  Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[59]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[60]  Xinyu Li,et al.  Instance-Based Video Search via Multi-Task Retrieval and Re-Ranking , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[61]  Zi Huang,et al.  Practical Online Near-Duplicate Subsequence Detection for Continuous Video Streams , 2010, IEEE Transactions on Multimedia.

[62]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[63]  Jie Shao,et al.  Temporal Context Aggregation for Video Retrieval with Contrastive Learning , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[64]  Cordelia Schmid,et al.  Event Retrieval in Large Video Collections with Circulant Temporal Encoding , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  Yu Liu,et al.  Correlation Congruence for Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[66]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[67]  Yuanyuan Yang,et al.  Multiscale video sequence matching for near-duplicate detection and retrieval , 2018, Multimedia Tools and Applications.

[68]  Junjie Yan,et al.  Mimicking Very Efficient Network for Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Ioannis Patras,et al.  FIVR: Fine-Grained Incident Video Retrieval , 2018, IEEE Transactions on Multimedia.

[70]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[71]  Yan Lu,et al.  Relational Knowledge Distillation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Juan Carlos Niebles,et al.  Graph Distillation for Action Detection with Privileged Modalities , 2017, ECCV.

[73]  Cordelia Schmid,et al.  Stable Hyper-pooling and Query Expansion for Event Detection , 2013, 2013 IEEE International Conference on Computer Vision.