论文信息 - Iterative Universal Hash Function Generator for Minhashing

Iterative Universal Hash Function Generator for Minhashing

Minhashing is a technique used to estimate the Jaccard Index between two sets by exploiting the probability of collision in a random permutation. In order to speed up the computation, a random permutation can be approximated by using an universal hash function such as the $h_{a,b}$ function proposed by Carter and Wegman. A better estimate of the Jaccard Index can be achieved by using many of these hash functions, created at random. In this paper a new iterative procedure to generate a set of $h_{a,b}$ functions is devised that eliminates the need for a list of random values and avoid the multiplication operation during the calculation. The properties of the generated hash functions remains that of an universal hash function family. This is possible due to the random nature of features occurrence on sparse datasets. Results show that the uniformity of hashing the features is maintaned while obtaining a speed up of up to $1.38$ compared to the traditional approach.

Fabrício Olivetti de França

[1] Masoud Nikravesh,et al. Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[2] Radoslaw Szmit. Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data , 2013, IIS.

[3] Piotr Indyk,et al. Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[4] Larry Carter,et al. Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[5] Fabrício Olivetti de França. Scalable Overlapping Co-clustering of Word-Document Data , 2012, 2012 11th International Conference on Machine Learning and Applications.

[6] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7] L. Ryd,et al. On bias. , 1994, Acta orthopaedica Scandinavica.

[8] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[9] H FriedmanJerome. On Bias, Variance, 0/1Loss, and the Curse-of-Dimensionality , 1997 .

[10] Alexandr Andoni,et al. Beyond Locality-Sensitive Hashing , 2013, SODA.

[11] Fernando José Von Zuben,et al. Feature Subset Selection by Means of a Bayesian Artificial Immune System , 2008, 2008 Eighth International Conference on Hybrid Intelligent Systems.

[12] Pascal Vincent,et al. Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[13] Ken Lang,et al. NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[14] João Miguel da Costa Sousa,et al. Metaheuristics for feature selection: Application to sepsis outcome prediction , 2012, 2012 IEEE Congress on Evolutionary Computation.