Codes for the World Wide Web

We introduce a new family of simple, complete instantaneous codes for positive integers, called ζ codes, which are suitable for integers distributed as a power law with small exponent (smaller than 2). The main motivation for the introduction of ζ codes comes from web-graph compression: if nodes are numbered according to URL lexicographical order, gaps in successor lists are distributed according to a power law with small exponent. We give estimates of the expected length of ζ codes against power-law distributions, and compare the results with analogous estimates for the more classical γ, δ and variable-length block codes.

[1]  Alistair Moffat,et al.  In-Place Calculation of Minimum-Redundancy Codes , 1995, WADS.

[2]  W. Rudin Real and complex analysis , 1968 .

[3]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[4]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[5]  Ronald L. Graham,et al.  Concrete mathematics - a foundation for computer science , 1991 .

[6]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[7]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[8]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[9]  Raymie Stata,et al.  The Link Database: fast access to graphs of the Web , 2002, Proceedings DCC 2002. Data Compression Conference.

[10]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[11]  Micah Adler,et al.  Towards compressing Web graphs , 2001, Proceedings DCC 2001. Data Compression Conference.