Rank hash similarity for fast similarity search

The paper is concerned with similarity search at large scale, which efficiently and effectively finds similar data points for a query data point. An efficient way to accelerate similarity search is to learn hash functions. The existing approaches for learning hash functions aim to obtain low values of Hamming distances for the similar pairs. However, these methods ignore the ranking order of these Hamming distances. This leads to the poor accuracy about finding similar items for a query data point. In this paper, an algorithm is proposed, referred to top k RHS (Rank Hash Similarity), in which a ranking loss function is designed for learning a hash function. The hash function is hypothesized to be made up of l binary classifiers. The issue of learning a hash function can be formulated as a task of learning l binary classifiers. The algorithm runs l rounds and learns a binary classifier at each round. Compared with the existing approaches, the proposed method has the same order of computational complexity. Nevertheless, experiment results on three text datasets show that the proposed method obtains higher accuracy than the baselines.

[1]  Jun Wang,et al.  Self-taught hashing for fast similarity search , 2010, SIGIR.

[2]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[3]  Jun Wang,et al.  Laplacian Co-hashing of Terms and Documents , 2010, ECIR.

[4]  Wei-Ying Ma,et al.  Locality preserving indexing for document representation , 2004, SIGIR '04.

[5]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Sergei Vassilvitskii,et al.  Nearest-neighbor caching for content-match applications , 2009, WWW '09.

[7]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[8]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[9]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[10]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[11]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[13]  Mingrui Wu,et al.  Gradient descent optimization of smoothed information retrieval metrics , 2010, Information Retrieval.

[14]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[15]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[16]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[17]  Shumeet Baluja,et al.  Learning to hash: forgiving hash functions and applications , 2008, Data Mining and Knowledge Discovery.

[18]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[19]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[20]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.