Learning to Hash for Indexing Big Data—A Survey

The explosive growth in Big Data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, approximate nearest neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., locality-sensitive hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning-to-hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros, and cons of the emerging techniques. We provide a comprehensive survey of the learning-to-hash framework and representative techniques of various types, including unsupervised, semisupervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area.

[1]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[2]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[3]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[4]  Stephen M. Omohundro,et al.  Efficient Algorithms with Neural Network Behavior , 1987, Complex Syst..

[5]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[6]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Gert Vegter,et al.  In handbook of discrete and computational geometry , 1997 .

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[10]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[11]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[12]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[13]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[15]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[16]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[17]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[18]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[20]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[21]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing) , 2006 .

[22]  Robert M. Gray,et al.  Toeplitz and Circulant Matrices: A Review , 2005, Found. Trends Commun. Inf. Theory.

[23]  Timothy J. Purcell Sorting and searching , 2005, SIGGRAPH Courses.

[24]  Gregory Shakhnarovich,et al.  Learning task-specific similarity , 2005 .

[25]  Robert M. Gray,et al.  Toeplitz And Circulant Matrices: A Review (Foundations and Trends(R) in Communications and Information Theory) , 2006 .

[26]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[27]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[28]  Namir,et al.  Authors , 1947, Praxis der Kinderpsychologie und Kinderpsychiatrie.

[29]  Sanjoy Dasgupta,et al.  A learning framework for nearest neighbor search , 2007, NIPS.

[30]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[31]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[32]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[33]  Geoffrey E. Hinton,et al.  Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[34]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[35]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[36]  Prateek Jain,et al.  Fast image search for learned metrics , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[38]  Inderjit S. Dhillon,et al.  Online Metric Learning and Fast Similarity Search , 2008, NIPS.

[39]  George Kollios,et al.  BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[41]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[42]  Antonio Torralba,et al.  Small codes and large image databases for recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Zhe Wang,et al.  Modeling LSH for performance tuning , 2008, CIKM '08.

[44]  Prateek Jain,et al.  Fast Similarity Search for Learned Metrics , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[46]  O. Chum,et al.  Geometric min-Hashing: Finding a (thick) needle in a haystack , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Trevor Darrell,et al.  Learning to Hash with Binary Reconstructive Embeddings , 2009, NIPS.

[48]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[49]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[50]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[51]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[52]  Wei Liu,et al.  Scalable similarity search with optimized kernel hashing , 2010, KDD.

[53]  Shuicheng Yan,et al.  Weakly-supervised hashing in kernel space , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[54]  Wei Liu,et al.  Large Graph Construction for Scalable Semi-Supervised Learning , 2010, ICML.

[55]  Jay Yagnik,et al.  SPEC hashing: Similarity preserving algorithm for entropy-based coding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[56]  Sergey Ioffe,et al.  Improved Consistent Sampling, Weighted Minhash and L1 Sketching , 2010, 2010 IEEE International Conference on Data Mining.

[57]  S. Yen,et al.  Nearest neighbor searching in high dimensions using multiple KD-trees , 2010 .

[58]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[59]  Shih-Fu Chang,et al.  Sequential Projection Learning for Hashing with Compact Codes , 2010, ICML.

[60]  Ping Li,et al.  b-Bit Minwise Hashing for Estimating Three-Way Similarities , 2010, NIPS.

[61]  Michael Isard,et al.  Partition Min-Hash for Partial Duplicate Image Discovery , 2010, ECCV.

[62]  Jun Wang,et al.  Self-taught hashing for fast similarity search , 2010, SIGIR.

[63]  Shih-Fu Chang,et al.  Semi-supervised hashing for scalable image retrieval , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[64]  Chunhua Shen,et al.  Rapid face recognition using hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[65]  Nikos Paragios,et al.  Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[66]  Shuicheng Yan,et al.  Non-Metric Locality-Sensitive Hashing , 2010, AAAI.

[67]  Kristen Grauman,et al.  Efficiently searching for similar images , 2010, Commun. ACM.

[68]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[69]  Shih-Fu Chang,et al.  Lost in binarization: query-adaptive ranking for similar image search with compact codes , 2011, ICMR '11.

[70]  Zi Huang,et al.  Multiple feature hashing for real-time large scale near-duplicate video retrieval , 2011, ACM Multimedia.

[71]  Nenghai Yu,et al.  Complementary hashing for approximate nearest neighbor search , 2011, 2011 International Conference on Computer Vision.

[72]  Anirban Dasgupta,et al.  Fast locality-sensitive hashing , 2011, KDD.

[73]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74]  David J. Fleet,et al.  Minimal Loss Hashing for Compact Binary Codes , 2011, ICML.

[75]  Shai Avidan,et al.  Coherency Sensitive Hashing , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76]  P. Schrimpf,et al.  Dynamic Programming , 2011 .

[77]  Lihi Zelnik-Manor,et al.  Approximate Nearest Subspace Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[78]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[79]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[80]  Svetlana Lazebnik,et al.  Asymmetric Distances for Binary Embeddings , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[81]  Regunathan Radhakrishnan,et al.  Compact hashing with joint optimization of search accuracy and time , 2011, CVPR 2011.

[82]  Wei Liu,et al.  Hashing with Graphs , 2011, ICML.

[83]  Fei Wang,et al.  Composite hashing with multiple information sources , 2011, SIGIR.

[84]  Olivier Buisson,et al.  Random maximum margin hashing , 2011, CVPR 2011.

[85]  Hongyuan Zha,et al.  Learning binary codes for collaborative filtering , 2012, KDD.

[86]  Shih-Fu Chang,et al.  Spherical hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[87]  David J. Fleet,et al.  Fast search in Hamming space with multi-index hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[88]  David J. Fleet,et al.  Hamming Distance Metric Learning , 2012, NIPS.

[89]  Sanjiv Kumar,et al.  Angular Quantization-based Binary Codes for Fast Similarity Search , 2012, NIPS.

[90]  Yi Zhen,et al.  A probabilistic model for multimodal hash function learning , 2012, KDD.

[91]  Vincent Lepetit,et al.  Efficient Discriminative Projections for Compact Binary Descriptors , 2012, ECCV.

[92]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[93]  Wei Liu,et al.  Compact Hyperplane Hashing with Bilinear Functions , 2012, ICML.

[94]  Pascal Fua,et al.  LDAHash: Improved Matching with Smaller Descriptors , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95]  Qi Tian,et al.  Super-Bit Locality-Sensitive Hashing , 2012, NIPS.

[96]  Jiri Matas,et al.  Fast computation of min-Hash signatures for image collections , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[97]  Seungjin Choi,et al.  Sequential Spectral Learning to Hash with Multiple Representations , 2012, ECCV.

[98]  Wu-Jun Li,et al.  Isotropic Hashing , 2012, NIPS.

[99]  Shih-Fu Chang,et al.  Semi-Supervised Hashing for Large-Scale Search , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[100]  Kristen Grauman,et al.  Kernelized Locality-Sensitive Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[101]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[102]  Minyi Guo,et al.  Manhattan hashing for large-scale image retrieval , 2012, SIGIR '12.

[103]  Ping Li,et al.  One Permutation Hashing , 2012, NIPS.

[104]  Sanjiv Kumar,et al.  On the Difficulty of Nearest Neighbor Search , 2012, ICML.

[105]  Shih-Fu Chang,et al.  Mobile product search with Bag of Hash Bits and boundary reranking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[106]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[107]  Xiao Zhang,et al.  QsRank: Query-sensitive hash code ranking for efficient ∊-neighbor search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[108]  Shih-Fu Chang,et al.  Accelerated Large Scale Optimization by Concomitant Hashing , 2012, ECCV.

[109]  Antonio Torralba,et al.  Multidimensional Spectral Hashing , 2012, ECCV.

[110]  Shih-Fu Chang,et al.  Submodular video hashing: a unified framework towards video pooling and indexing , 2012, ACM Multimedia.

[111]  Ali Farhadi,et al.  Attribute Discovery via Predictable Discriminative Binary Codes , 2012, ECCV.

[112]  Yi Zhen,et al.  Co-Regularized Hashing for Multimodal Data , 2012, NIPS.

[113]  Guillaume Gravier,et al.  Efficient Mining of Repetitions in Large-Scale TV Streams with Product Quantization Hashing , 2012, ECCV Workshops.

[114]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..

[115]  Xu Wang,et al.  Fast Subspace Search via Grassmannian Based Hashing , 2013, 2013 IEEE International Conference on Computer Vision.

[116]  Dan Zhang,et al.  Semantic hashing using tags and topic modeling , 2013, SIGIR.

[117]  Wei Liu,et al.  Learning Hash Codes with Listwise Supervision , 2013, 2013 IEEE International Conference on Computer Vision.

[118]  David Suter,et al.  A General Two-Step Approach to Learning-Based Hashing , 2013, 2013 IEEE International Conference on Computer Vision.

[119]  Jian Sun,et al.  K-Means Hashing: An Affinity-Preserving Quantization Method for Learning Binary Compact Codes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[120]  Yongdong Zhang,et al.  Binary Code Ranking with Weighted Hamming Distance , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[121]  Lixin Fan Supervised Binary Hash Code Learning with Jensen Shannon Divergence , 2013, 2013 IEEE International Conference on Computer Vision.

[122]  Jian Sun,et al.  Optimized Product Quantization for Approximate Nearest Neighbor Search , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[123]  Dong Liu,et al.  Large-Scale Video Hashing via Structure Learning , 2013, 2013 IEEE International Conference on Computer Vision.

[124]  Anton van den Hengel,et al.  Learning Compact Binary Codes for Visual Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[125]  Shih-Fu Chang,et al.  Hash Bit Selection: A Unified Solution for Selection Problems in Hashing , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[126]  Sanjiv Kumar,et al.  Learning Binary Codes for High-Dimensional Data Using Bilinear Projections , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[127]  Xuelong Li,et al.  Complementary Projection Hashing , 2013, 2013 IEEE International Conference on Computer Vision.

[128]  Jonghyun Choi,et al.  Predictable Dual-View Hashing , 2013, ICML.

[129]  Kristen Grauman,et al.  Learning Binary Hash Codes for Large-Scale Image Search , 2013, Machine Learning for Computer Vision.

[130]  Xuelong Li,et al.  Compressed Hashing , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[131]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[132]  Nenghai Yu,et al.  Order preserving hashing for approximate nearest neighbor search , 2013, ACM Multimedia.

[133]  Fumin Shen,et al.  Inductive Hashing on Manifolds , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[134]  Jun Wang,et al.  Comparing apples to oranges: a scalable solution with heterogeneous hashing , 2013, KDD.

[135]  Guosheng Lin,et al.  Learning Hash Functions Using Column Generation , 2013, ICML.

[136]  Shih-Fu Chang,et al.  Query-Adaptive Image Search With Hash Codes , 2013, IEEE Transactions on Multimedia.

[137]  Jun Wang,et al.  Fast Pairwise Query Selection for Large-Scale Active Learning to Rank , 2013, 2013 IEEE 13th International Conference on Data Mining.

[138]  Dan Zhang,et al.  Weighted hashing for fast large scale similarity search , 2013, CIKM.

[139]  Shih-Fu Chang,et al.  Circulant Binary Embedding , 2014, ICML.

[140]  Wei Liu,et al.  Discrete Graph Hashing , 2014, NIPS.

[141]  Gang Hua,et al.  Hash-SVM: Scalable Kernel Machines for Large-Scale Visual Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[142]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[143]  Heng Ji,et al.  Two-Stage Hashing for Fast Document Retrieval , 2014, ACL.

[144]  Guillermo Sapiro,et al.  Sparse similarity-preserving hashing , 2013, ICLR.

[145]  Prateek Jain,et al.  Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[146]  Hanjiang Lai,et al.  Supervised Hashing for Image Retrieval via Image Representation Learning , 2014, AAAI.

[147]  Jun Wang,et al.  Non-transitive Hashing with Latent Similarity Components , 2015, KDD.

[148]  Jun Wang,et al.  Probabilistic Attributed Hashing , 2015, AAAI.

[149]  Jiwen Lu,et al.  Deep hashing for compact binary codes learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[150]  Hanjiang Lai,et al.  Simultaneous feature learning and hash coding with deep neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[151]  Wei Liu,et al.  Supervised Discrete Hashing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[152]  Jun Wang,et al.  Optimal Bayesian Hashing for Efficient Face Recognition , 2015, IJCAI.

[153]  Tieniu Tan,et al.  Deep semantic ranking based hashing for multi-label image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[154]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[155]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..