b-Bit minwise hashing

This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, computational advertising, etc. By only storing b bits of each hashed value (e.g., b=1 or 2), we gain substantial advantages in terms of storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to b=64 (or b=32), if one is interested in resemblance >0.5.

[1]  R. A. Fisher,et al.  Statistical Tables for Biological, Agricultural and Medical Research , 1956 .

[2]  J. Wishart Statistical tables , 2018, Global Education Monitoring Report.

[3]  Forest Baskett,et al.  An Algorithm for Finding Nearest Neighbors , 1975, IEEE Transactions on Computers.

[4]  Larry Carter,et al.  Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[5]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[6]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[7]  Martti Penttonen,et al.  A Reliable Randomized Algorithm for the Closest-Pair Problem , 1997, J. Algorithms.

[8]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[9]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[10]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[12]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[13]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[14]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[15]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[16]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[17]  Dimitrios Gunopulos,et al.  Efficient and tumble similar set retrieval , 2001, SIGMOD '01.

[18]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[19]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[20]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[21]  Henry S. Warren,et al.  Hacker's Delight , 2002 .

[22]  Monika Henzinger,et al.  Algorithmic Challenges in Web Search Engines , 2004, Internet Math..

[23]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[24]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[25]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[26]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[27]  Yoshinori Takei,et al.  On the sample size of k-restricted min-wise independent permutations and other k-wise distributions , 2003, STOC '03.

[28]  Mikkel Thorup,et al.  Tabulation based 4-universal hashing with applications to second moment estimation , 2004, SODA '04.

[29]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[30]  Kenneth Ward Church,et al.  Using Sketches to Estimate Associations , 2005, HLT.

[31]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[32]  Pietro Perona,et al.  Beyond pairwise clustering , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[33]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[34]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[35]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[36]  Kenneth Ward Church,et al.  Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data , 2006, NIPS.

[37]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[38]  Kenneth Ward Church,et al.  Improving Random Projections Using Marginal Information , 2006, COLT.

[39]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[40]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[41]  Ping Li,et al.  Very sparse stable random projections for dimension reduction in lα (0 <α ≤ 2) norm , 2007, KDD '07.

[42]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[43]  Konstantinos Kalpakis,et al.  Collaborative Data Gathering in Wireless Sensor Networks Using Measurement Co-Occurrence , 2007, 2007 International Conference on Sensor Technologies and Applications (SENSORCOMM 2007).

[44]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[45]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[46]  Kenneth Ward Church,et al.  A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations , 2007, CL.

[47]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[48]  Kenneth Ward Church Approximate Lexicography and Web Search , 2008 .

[49]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[50]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[51]  Thomas Lavergne,et al.  Tracking Web spam with HTML style similarities , 2008, TWEB.

[52]  Kenneth Ward Church,et al.  One sketch for all: Theory and Application of Conditional Random Sampling , 2008, NIPS.

[53]  ChengXiang Zhai,et al.  Mining term association patterns from search logs for effective query reformulation , 2008, CIKM '08.

[54]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[55]  Michael Gamon,et al.  BLEWS: Using Blogs to Provide Context for News Articles , 2008, ICWSM.

[56]  Gregory Buehrer,et al.  A scalable pattern mining approach to web graph compression with communities , 2008, WSDM '08.

[57]  Moni Naor,et al.  Derandomized Constructions of k-Wise (Almost) Independent Permutations , 2005, Algorithmica.

[58]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[59]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[60]  Qiang Wu,et al.  Click-through prediction for news queries , 2009, SIGIR.

[61]  Kyuseok Shim,et al.  Power-Law Based Estimation of Set Similarity Join Size , 2009, Proc. VLDB Endow..

[62]  Fernando Diaz,et al.  Integration of news content into web results , 2009, WSDM '09.

[63]  W. Bruce Croft,et al.  Finding text reuse on the web , 2009, WSDM '09.

[64]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[65]  Sergei Vassilvitskii,et al.  Nearest-neighbor caching for content-match applications , 2009, WWW '09.

[66]  Sreenivas Gollapudi,et al.  Less is more: sampling the neighborhood graph makes SALSA better and faster , 2009, WSDM '09.

[67]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[68]  Marco Pellegrini,et al.  Extraction and classification of dense implicit communities in the Web graph , 2009, TWEB.

[69]  Ludmila Cherkasova,et al.  Applying syntactic similarity algorithms for enterprise information management , 2009, KDD.

[70]  George Forman,et al.  Efficient detection of large-scale redundancy in enterprise file systems , 2009, OPSR.

[71]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[72]  Ping Li,et al.  b-Bit Minwise Hashing for Estimating Three-Way Similarities , 2010, NIPS.

[73]  Ping Li,et al.  Accurate Estimators for Improving Minwise Hashing and b-Bit Minwise Hashing , 2011, ArXiv.

[74]  Ping Li,et al.  Theory and applications of b-bit minwise hashing , 2011, Commun. ACM.

[75]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[76]  Ping Li,et al.  Fast Near Neighbor Search in High-Dimensional Binary Data , 2012, ECML/PKDD.