Learning to Hash for Indexing Big DataVA Survey Thispaperprovidesreaderswithasystematicunderstandingofinsights,pros,andcons of the emerging indexing and search methods for Big Data.

The explosive growth in Big Data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large- scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive com- parison is infeasible due to the prohibitive computational complexity and memory requirement. In response, approxi- mate nearest neighbor (ANN) search based on hashing techniques has become popular due to its promising perfor- mance in both efficiency and accuracy. Prior randomized hashing methods, e.g., locality-sensitive hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real- world applications. As a remedy, new approaches incorporat- ing data-driven learning methods in development of advanced hash functions have emerged. Such learning-to-hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros, and cons of the emerging techniques. We provide a comprehensive survey of the learning-to-hash framework and representative techniques of various types, including unsupervised, semisupervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area.

[1]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[2]  Shih-Fu Chang,et al.  Sequential Projection Learning for Hashing with Compact Codes , 2010, ICML.

[3]  Jun Wang,et al.  Self-taught hashing for fast similarity search , 2010, SIGIR.

[4]  Shih-Fu Chang,et al.  Semi-Supervised Hashing for Large-Scale Search , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Jun Wang,et al.  Comparing apples to oranges: a scalable solution with heterogeneous hashing , 2013, KDD.

[6]  Jian Sun,et al.  Optimized Product Quantization for Approximate Nearest Neighbor Search , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[8]  Yi Zhen,et al.  A probabilistic model for multimodal hash function learning , 2012, KDD.

[9]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[10]  Seungjin Choi,et al.  Sequential Spectral Learning to Hash with Multiple Representations , 2012, ECCV.

[11]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[12]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[13]  Guosheng Lin,et al.  Learning Hash Functions Using Column Generation , 2013, ICML.

[14]  Shih-Fu Chang,et al.  Spherical hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Prateek Jain,et al.  Fast Similarity Search for Learned Metrics , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Hongyuan Zha,et al.  Learning binary codes for collaborative filtering , 2012, KDD.

[17]  Inderjit S. Dhillon,et al.  Online Metric Learning and Fast Similarity Search , 2008, NIPS.

[18]  Xuelong Li,et al.  Compressed Hashing , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Guillermo Sapiro,et al.  Sparse similarity-preserving hashing , 2013, ICLR.

[21]  Olivier Buisson,et al.  Random maximum margin hashing , 2011, CVPR 2011.

[22]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[23]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[24]  Jun Wang,et al.  Probabilistic Attributed Hashing , 2015, AAAI.

[25]  Wu-Jun Li,et al.  Isotropic Hashing , 2012, NIPS.

[26]  Tieniu Tan,et al.  Deep semantic ranking based hashing for multi-label image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Wei Liu,et al.  Learning Hash Codes with Listwise Supervision , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  Jun Wang,et al.  Fast Pairwise Query Selection for Large-Scale Active Learning to Rank , 2013, 2013 IEEE 13th International Conference on Data Mining.

[30]  Wei Liu,et al.  Scalable similarity search with optimized kernel hashing , 2010, KDD.

[31]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[32]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[33]  Wei Liu,et al.  Hashing with Graphs , 2011, ICML.

[34]  Sergey Ioffe,et al.  Improved Consistent Sampling, Weighted Minhash and L1 Sketching , 2010, 2010 IEEE International Conference on Data Mining.

[35]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[36]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[37]  Prateek Jain,et al.  Fast image search for learned metrics , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Fei Wang,et al.  Composite hashing with multiple information sources , 2011, SIGIR.

[39]  Stephen M. Omohundro,et al.  Efficient Algorithms with Neural Network Behavior , 1987, Complex Syst..

[40]  Hanjiang Lai,et al.  Simultaneous feature learning and hash coding with deep neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Sanjiv Kumar,et al.  On the Difficulty of Nearest Neighbor Search , 2012, ICML.

[42]  Shih-Fu Chang,et al.  Submodular video hashing: a unified framework towards video pooling and indexing , 2012, ACM Multimedia.

[43]  Jiwen Lu,et al.  Deep hashing for compact binary codes learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..

[45]  Zi Huang,et al.  Multiple feature hashing for real-time large scale near-duplicate video retrieval , 2011, ACM Multimedia.

[46]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[47]  Sanjiv Kumar,et al.  Angular Quantization-based Binary Codes for Fast Similarity Search , 2012, NIPS.

[48]  Ali Farhadi,et al.  Attribute Discovery via Predictable Discriminative Binary Codes , 2012, ECCV.

[49]  Yongdong Zhang,et al.  Binary Code Ranking with Weighted Hamming Distance , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[51]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[52]  Anirban Dasgupta,et al.  Fast locality-sensitive hashing , 2011, KDD.

[53]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[54]  Dan Zhang,et al.  Semantic hashing using tags and topic modeling , 2013, SIGIR.

[55]  Gregory Shakhnarovich,et al.  Learning task-specific similarity , 2005 .

[56]  Shuicheng Yan,et al.  Non-Metric Locality-Sensitive Hashing , 2010, AAAI.

[57]  Shih-Fu Chang,et al.  Lost in binarization: query-adaptive ranking for similar image search with compact codes , 2011, ICMR '11.

[58]  Heng Ji,et al.  Two-Stage Hashing for Fast Document Retrieval , 2014, ACL.

[59]  Jiri Matas,et al.  Fast computation of min-Hash signatures for image collections , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[61]  Antonio Torralba,et al.  Small codes and large image databases for recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Sanjoy Dasgupta,et al.  A learning framework for nearest neighbor search , 2007, NIPS.

[63]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[64]  Shih-Fu Chang,et al.  Query-Adaptive Image Search With Hash Codes , 2013, IEEE Transactions on Multimedia.

[65]  Jonghyun Choi,et al.  Predictable Dual-View Hashing , 2013, ICML.

[66]  David J. Fleet,et al.  Hamming Distance Metric Learning , 2012, NIPS.

[67]  Pascal Fua,et al.  LDAHash: Improved Matching with Smaller Descriptors , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  Shih-Fu Chang,et al.  Hash Bit Selection: A Unified Solution for Selection Problems in Hashing , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  Ping Li,et al.  One Permutation Hashing , 2012, NIPS.

[70]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[71]  Xiao Zhang,et al.  QsRank: Query-sensitive hash code ranking for efficient ∊-neighbor search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[72]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Vincent Lepetit,et al.  Efficient Discriminative Projections for Compact Binary Descriptors , 2012, ECCV.

[74]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[75]  Jay Yagnik,et al.  SPEC hashing: Similarity preserving algorithm for entropy-based coding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[76]  Anton van den Hengel,et al.  Learning Compact Binary Codes for Visual Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[77]  Geoffrey E. Hinton,et al.  Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[78]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[79]  Minyi Guo,et al.  Manhattan hashing for large-scale image retrieval , 2012, SIGIR '12.

[80]  David J. Fleet,et al.  Fast search in Hamming space with multi-index hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[81]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[82]  George Kollios,et al.  BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[83]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[84]  Zhe Wang,et al.  Modeling LSH for performance tuning , 2008, CIKM '08.

[85]  Wei Liu,et al.  Compact Hyperplane Hashing with Bilinear Functions , 2012, ICML.

[86]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[87]  Ping Li,et al.  b-Bit Minwise Hashing for Estimating Three-Way Similarities , 2010, NIPS.

[88]  Nenghai Yu,et al.  Order preserving hashing for approximate nearest neighbor search , 2013, ACM Multimedia.

[89]  Wei Liu,et al.  Discrete Graph Hashing , 2014, NIPS.

[90]  Jun Wang,et al.  Optimal Bayesian Hashing for Efficient Face Recognition , 2015, IJCAI.

[91]  Xu Wang,et al.  Fast Subspace Search via Grassmannian Based Hashing , 2013, 2013 IEEE International Conference on Computer Vision.

[92]  Nenghai Yu,et al.  Complementary hashing for approximate nearest neighbor search , 2011, 2011 International Conference on Computer Vision.

[93]  Kristen Grauman,et al.  Kernelized Locality-Sensitive Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[94]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[95]  Dan Zhang,et al.  Weighted hashing for fast large scale similarity search , 2013, CIKM.

[96]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[97]  Kristen Grauman,et al.  Learning Binary Hash Codes for Large-Scale Image Search , 2013, Machine Learning for Computer Vision.

[98]  Hanjiang Lai,et al.  Supervised Hashing for Image Retrieval via Image Representation Learning , 2014, AAAI.

[99]  Shuicheng Yan,et al.  Weakly-supervised hashing in kernel space , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[100]  Albert Gordo,et al.  Asymmetric distances for binary embeddings , 2011, CVPR.

[101]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[102]  Shai Avidan,et al.  Coherency Sensitive Hashing , 2011, ICCV.

[103]  Svetlana Lazebnik,et al.  Asymmetric Distances for Binary Embeddings , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[104]  Kristen Grauman,et al.  Efficiently searching for similar images , 2010, Commun. ACM.

[105]  Yi Zhen,et al.  Co-Regularized Hashing for Multimodal Data , 2012, NIPS.

[106]  Wei Liu,et al.  Supervised Discrete Hashing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[107]  Regunathan Radhakrishnan,et al.  Compact hashing with joint optimization of search accuracy and time , 2011, CVPR 2011.

[108]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[109]  Fumin Shen,et al.  Inductive Hashing on Manifolds , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[110]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[111]  Jiri Matas,et al.  Geometric min-Hashing: Finding a (thick) needle in a haystack , 2009, CVPR.

[112]  Lihi Zelnik-Manor,et al.  Approximate Nearest Subspace Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[113]  Nikos Paragios,et al.  Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[114]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[115]  Gang Hua,et al.  Hash-SVM: Scalable Kernel Machines for Large-Scale Visual Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[116]  Jian Sun,et al.  K-Means Hashing: An Affinity-Preserving Quantization Method for Learning Binary Compact Codes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[117]  Sanjiv Kumar,et al.  Learning Binary Codes for High-Dimensional Data Using Bilinear Projections , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[118]  Shih-Fu Chang,et al.  Mobile product search with Bag of Hash Bits and boundary reranking , 2012, CVPR.

[119]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[120]  Shih-Fu Chang,et al.  Circulant Binary Embedding , 2014, ICML.

[121]  Shih-Fu Chang,et al.  Semi-supervised hashing for scalable image retrieval , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[122]  Xuelong Li,et al.  Complementary Projection Hashing , 2013, 2013 IEEE International Conference on Computer Vision.

[123]  Shih-Fu Chang,et al.  Accelerated Large Scale Optimization by Concomitant Hashing , 2012, ECCV.

[124]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[125]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[126]  Chunhua Shen,et al.  Rapid face recognition using hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[127]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[128]  Antonio Torralba,et al.  Multidimensional Spectral Hashing , 2012, ECCV.

[129]  Trevor Darrell,et al.  Learning to Hash with Binary Reconstructive Embeddings , 2009, NIPS.

[130]  Qi Tian,et al.  Super-Bit Locality-Sensitive Hashing , 2012, NIPS.

[131]  Prateek Jain,et al.  Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[132]  David J. Fleet,et al.  Minimal Loss Hashing for Compact Binary Codes , 2011, ICML.

[133]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing) , 2006 .

[134]  Guillaume Gravier,et al.  Efficient Mining of Repetitions in Large-Scale TV Streams with Product Quantization Hashing , 2012, ECCV Workshops.

[135]  Michael Isard,et al.  Partition Min-Hash for Partial Duplicate Image Discovery , 2010, ECCV.

[136]  David Suter,et al.  A General Two-Step Approach to Learning-Based Hashing , 2013, 2013 IEEE International Conference on Computer Vision.

[137]  Wei Liu,et al.  Large Graph Construction for Scalable Semi-Supervised Learning , 2010, ICML.

[138]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[139]  Robert M. Gray,et al.  Toeplitz and Circulant Matrices: A Review , 2005, Found. Trends Commun. Inf. Theory.

[140]  Dong Liu,et al.  Large-Scale Video Hashing via Structure Learning , 2013, 2013 IEEE International Conference on Computer Vision.