Two-Stage Hashing for Fast Document Retrieval

This work fulfills sublinear time Nearest Neighbor Search (NNS) in massivescale document collections. The primary contribution is to propose a two-stage unsupervised hashing framework which harmoniously integrates two state-of-theart hashing algorithms Locality Sensitive Hashing (LSH) and Iterative Quantization (ITQ). LSH accounts for neighbor candidate pruning, while ITQ provides an efficient and effective reranking over the neighbor pool captured by LSH. Furthermore, the proposed hashing framework capitalizes on both term and topic similarity among documents, leading to precise document retrieval. The experimental results convincingly show that our hashing based document retrieval approach well approximates the conventional Information Retrieval (IR) method in terms of retrieving semantically similar documents, and meanwhile achieves a speedup of over one order of magnitude in query time.

[1]  David A. Smith,et al.  A Minimally Supervised Approach for Detecting and Ranking Document Translation Pairs , 2011, WMT@EMNLP.

[2]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[3]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[4]  Wei Liu,et al.  Sub-Selective Quantization for Large-Scale Image Search , 2014, AAAI.

[5]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[6]  Raghavendra Udupa,et al.  Hashing-Based Approaches to Spelling Correction of Personal Names , 2010, EMNLP.

[7]  Antonio Torralba,et al.  Small codes and large image databases for recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[9]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[10]  Jonathon Shlens,et al.  Fast, Accurate Detection of 100,000 Object Classes on a Single Machine , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Eduard H. Hovy,et al.  Unsupervised Mining of Lexical Variants from Noisy Text , 2011, ULNLP@EMNLP.

[12]  Kristen Grauman,et al.  Kernelized Locality-Sensitive Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[15]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[16]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[17]  Wei Liu Large-Scale Machine Learning for Classification and Search , 2012 .

[18]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[19]  Wei Liu,et al.  Hashing with Graphs , 2011, ICML.

[20]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[21]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..