BitHash: An efficient bitwise Locality Sensitive Hashing method with applications

Abstract Locality Sensitive Hashing has been applied to detecting near-duplicate images, videos and web documents. In this paper we present a Bitwise Locality Sensitive method by using only one bit per hash value (BitHash), the storage space for storing hash values is significantly reduced, and the estimator can be computed much faster. The method provides an unbiased estimate of pairwise Jaccard similarity, and the estimator is a linear function of Hamming distance, which is very simple. We rigorously analyze the variance of One-Bit Min-Hash (BitHash), showing that for high Jaccard similarity. BitHash may provide accurate estimation, and as the pairwise Jaccard similarity increases, the variance ratio of BitHash over the original min-hash decreases. Furthermore, BitHash compresses each data sample into a compact binary hash code while preserving the pairwise similarity of the original data. The binary code can be used as a compressed and informative representation in replacement of the original data for subsequent processing. For example, it can be naturally integrated with a classifier like SVM. We apply BitHash to two typical applications, near-duplicate image detection and sentiment analysis. Experiments on real user’s photo collection and a popular sentiment analysis data set show that, the classification accuracy of our proposed method for two applications could approach the state-of-the-art method, while BitHash only requires a significantly smaller storage space.

[1]  Bo Zhang,et al.  Large Scale Sentiment Analysis with Locality Sensitive BitHash , 2015, AIRS.

[2]  Rajen Dinesh Shah,et al.  Min-wise hashing for large-scale regression and classication with sparse data , 2013 .

[3]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[4]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[5]  Guillaume Gravier,et al.  Sim-min-hash: an efficient matching technique for linking large image collections , 2013, ACM Multimedia.

[6]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[7]  Bing Liu,et al.  Sentiment Analysis and Subjectivity , 2010, Handbook of Natural Language Processing.

[8]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[9]  Eric P. Xing,et al.  Conditional Topic Random Fields , 2010, ICML.

[10]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[11]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[12]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[13]  Qi Tian,et al.  Min-Max Hash for Jaccard Similarity , 2013, 2013 IEEE 13th International Conference on Data Mining.

[14]  Shih-Fu Chang,et al.  Detecting image near-duplicate by stochastic attributed relational graph matching with learning , 2004, MULTIMEDIA '04.

[15]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[16]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[17]  Michael Isard,et al.  General Theory , 1969 .

[18]  Qi Tian,et al.  Angular-Similarity-Preserving Binary Signatures for Linear Subspaces , 2015, IEEE Transactions on Image Processing.

[19]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[20]  Qi Tian,et al.  Batch-Orthogonal Locality-Sensitive Hashing for Angular Similarity , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Marc'Aurelio Ranzato,et al.  Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews , 2014, ICLR.

[22]  Xinbo Gao,et al.  Semi-supervised constraints preserving hashing , 2015, Neurocomputing.

[23]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[24]  Jun Wang,et al.  Probabilistic Attributed Hashing , 2015, AAAI.

[25]  Stefan Winkler,et al.  California-ND: An annotated dataset for near-duplicate detection in personal photo collections , 2013, 2013 Fifth International Workshop on Quality of Multimedia Experience (QoMEX).

[26]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[27]  Hua Xu,et al.  Weakness Finder: Find product weakness from Chinese reviews by using aspects based sentiment analysis , 2012, Expert Syst. Appl..

[28]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[29]  Wei Liu,et al.  Discrete Graph Hashing , 2014, NIPS.

[30]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.