Central Similarity Quantization for Efficient Image and Video Retrieval

Existing data-dependent hashing methods usually learn hash functions from pairwise or triplet data relationships, which only capture the data similarity locally, and often suffer from low learning efficiency and low collision rate. In this work, we propose a new \emph{global} similarity metric, termed as \emph{central similarity}, with which the hash codes of similar data pairs are encouraged to approach a common center and those for dissimilar pairs to converge to different centers, to improve hash learning efficiency and retrieval accuracy. We principally formulate the computation of the proposed central similarity metric by introducing a new concept, i.e., \emph{hash center} that refers to a set of data points scattered in the Hamming space with a sufficient mutual distance between each other. We then provide an efficient method to construct well separated hash centers by leveraging the Hadamard matrix and Bernoulli distributions. Finally, we propose the Central Similarity Quantization (CSQ) that optimizes the central similarity between data points w.r.t.\ their hash centers instead of optimizing the local similarity. CSQ is generic and applicable to both image and video hashing scenarios. Extensive experiments on large-scale image and video retrieval tasks demonstrate that CSQ can generate cohesive hash codes for similar data pairs and dispersed hash codes for dissimilar pairs, achieving a noticeable boost in retrieval performance, i.e. 3\%-20\% in mAP over the previous state-of-the-arts.

[1]  Kien A. Hua,et al.  DLSTM approach to video modeling with hashing for large-scale video retrieval , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[2]  Wei Liu,et al.  Multimedia Hashing and Networking , 2016, IEEE MultiMedia.

[3]  Wei Liu,et al.  Learning Binary Codes for Maximum Inner Product Search , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Chao Ma,et al.  Supervised Recurrent Hashing for Large Scale Video Retrieval , 2016, ACM Multimedia.

[5]  Wei Liu,et al.  Discrete Graph Hashing , 2014, NIPS.

[6]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[7]  Aapo Hyvärinen,et al.  Natural Image Statistics - A Probabilistic Approach to Early Computational Vision , 2009, Computational Imaging and Vision.

[8]  Wei Liu,et al.  Supervised Discrete Hashing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[10]  David J. Fleet,et al.  Hamming Distance Metric Learning , 2012, NIPS.

[11]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[12]  Wei Liu,et al.  Learning to Hash for Indexing Big Data—A Survey , 2015, Proceedings of the IEEE.

[13]  Junzhou Huang,et al.  Sub-Selective Quantization for Learning Binary Codes in Large-Scale Image Search , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Wu-Jun Li,et al.  Feature Learning Based Deep Supervised Hashing with Pairwise Labels , 2015, IJCAI.

[15]  Ping Li,et al.  Cycle-SUM: Cycle-consistent Adversarial LSTM Networks for Unsupervised Video Summarization , 2019, AAAI.

[16]  Trevor Darrell,et al.  Learning to Hash with Binary Reconstructive Embeddings , 2009, NIPS.

[17]  John Erik Hershey,et al.  A Hadamard Matrix , 1997 .

[18]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[19]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[21]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[22]  Jiwen Lu,et al.  Deep Video Hashing , 2017, IEEE Transactions on Multimedia.

[23]  Wei Liu,et al.  Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Shiguang Shan,et al.  Deep Supervised Hashing for Fast Image Retrieval , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Wei Liu,et al.  Learning Hash Codes with Listwise Supervision , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Rongrong Ji,et al.  Top Rank Supervised Binary Coding for Visual Search , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Wei Liu,et al.  Hashing with Graphs , 2011, ICML.

[30]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[31]  Manohar Paluri,et al.  Metric Learning with Adaptive Density Discrimination , 2015, ICLR.

[32]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[33]  Hanjiang Lai,et al.  Simultaneous feature learning and hash coding with deep neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Philip S. Yu,et al.  HashNet: Deep Learning to Hash by Continuation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[36]  Francis Eng Hock Tay,et al.  Unsupervised Video Summarization With Cycle-Consistent Adversarial LSTM Networks , 2020, IEEE Transactions on Multimedia.

[37]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Sharath Pankanti,et al.  RepMet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jianmin Wang,et al.  Deep Cauchy Hashing for Hamming Space Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[42]  Hanjiang Lai,et al.  Supervised Hashing for Image Retrieval via Image Representation Learning , 2014, AAAI.

[43]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[44]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[47]  Shuicheng Yan,et al.  Multi-Fiber Networks for Video Recognition , 2018, ECCV.

[48]  Jianmin Wang,et al.  Deep Hashing Network for Efficient Similarity Retrieval , 2016, AAAI.

[49]  Dacheng Tao,et al.  DistillHash: Unsupervised Deep Hashing by Distilling Data Pairs , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Ling Shao,et al.  Fast action retrieval from videos via feature disaggregation , 2017, Computer Vision and Image Understanding.

[51]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[52]  De Xie,et al.  Cross-Modal Learning with Adversarial Samples , 2019, NeurIPS.

[53]  Raymond Chiong,et al.  Why Is Optimization Difficult? , 2009, Nature-Inspired Algorithms for Optimisation.

[54]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .