Linear hash functions

Consider the set <inline-equation> <f> <sc>H</sc> </f> </inline-equation> of all linear (or affine) transformations between two vector spaces over a finite field <italic>F</italic>. We study how good <inline-equation> <f> <sc>H</sc></f></inline-equation> is as a class of hash functions, namely we consider hashing a set <italic>S</italic> of size <italic>n</italic> into a range having the same cardinality <italic>n</italic> by a randomly chosen function from <inline-equation> <f> <sc>H</sc></f></inline-equation> and look at the expected size of the largest hash bucket. <inline-equation> <f> <sc>H</sc></f></inline-equation> is a universal class of hash functions for any finite field, but with respect to our measure different fields behave differently. If the finite field <italic>F</italic> has <italic>n</italic> elements, then there is a bad set <italic>S</italic> <inline-equation> <f> ⊂</f></inline-equation> <italic>F</italic><supscrpt>2</supscrpt> of size <italic>n</italic> with expected maximal bucket size <inline-equation> <f> <sc>H</sc></f></inline-equation>(<italic>n</italic><supscrpt>1/3</supscrpt>). If <italic>n</italic> is a perfect square, then there is even a bad set with largest bucket size <italic>always</italic> at least <inline-equation> <f> <rad> <rcd>n</rcd></rad></f></inline-equation>. (This is worst possible, since with respect to a universal class of hash functions every set of size <italic>n</italic> has expected largest bucket size below <inline-equation> <f> <rad> <rcd>n</rcd></rad></f></inline-equation> + 1/2.) If, however, we consider the field of two elements, then we get much better bounds. The best previously known upper bound on the expected size of the largest bucket for this class was <italic>O</italic>(2<supscrpt><inline-equation> <f> <rad> <rcd>log n</rcd></rad></f></inline-equation></supscrpt>). We reduce this upper bound to <italic>O</italic>(log <italic>n</italic> log log<italic>n</italic>). Note that this is not far from the guarantee for a random function. There, the average largest bucket would be &THgr;(log <italic>n</italic>/ log log <italic>n</italic>). In the course of our proof we develop a tool which may be of independent interest. Suppose we have a subset <italic>S</italic> of a vector space <italic>D</italic> over <bold>Z</bold><subscrpt>2</subscrpt>, and consider a random linear mapping of <italic>D</italic> to a smaller vector space <italic>R</italic>. If the cardinality of <italic>S</italic> is larger than <italic>c</italic><subscrpt>ε</subscrpt>|<italic>R</italic>|log|<italic>R</italic>|, then with probability 1 - ε, the image of <italic>S</italic> will cover all elements in the range.

[1]  Martti Penttonen,et al.  A Reliable Randomized Algorithm for the Closest-Pair Problem , 1997, J. Algorithms.

[2]  J. Pach,et al.  Combinatorial geometry , 1995, Wiley-Interscience series in discrete mathematics and optimization.

[3]  Rajeev Raman,et al.  Sorting in linear time? , 1995, STOC '95.

[4]  Yossi Matias,et al.  Polynomial Hash Functions Are Reliable (Extended Abstract) , 1992, ICALP.

[5]  G. H. Gonnet,et al.  Handbook of algorithms and data structures: in Pascal and C (2nd ed.) , 1991 .

[6]  Noam Nisan,et al.  The computational complexity of universal hashing , 1990, Proceedings Fifth Annual Structure in Complexity Theory Conference.

[7]  Alan Siegel,et al.  On universal classes of fast high performance hash functions, their time-space tradeoff, and their applications , 1989, 30th Annual Symposium on Foundations of Computer Science.

[8]  Friedhelm Meyer auf der Heide,et al.  Dynamic perfect hashing: upper and lower bounds , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[9]  Noga Alon,et al.  A Fast and Simple Randomized Parallel Algorithm for the Maximal Independent Set Problem , 1985, J. Algorithms.

[10]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[11]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[12]  Larry Carter,et al.  Analysis of a Universal Class of Hash Functions , 1978, MFCS.

[13]  Friedhelm Meyer auf der Heide,et al.  Dynamic Hashing in Real Time , 1992, Informatik.

[14]  S. Graham,et al.  Lower Bounds for Least Quadratic Non-Residues , 1990 .

[15]  Gaston H. Gonnet,et al.  Handbook Of Algorithms And Data Structures , 1984 .

[16]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .