Towards compressing Web graphs

We consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by random graph models for describing the Web. The algorithms are based on reducing the compression problem to the problem of finding a minimum spanning free in a directed graph related to the original link graph. The performance of the algorithms on graphs generated by the random graph models suggests that by taking advantage of the link structure of the Web, one may achieve significantly better compression than natural Huffman-based schemes. We also provide hardness results demonstrating limitations on natural extensions of our approach.

[1]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.

[2]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[3]  David B. Shmoys,et al.  Approximation algorithms for facility location problems , 2000, APPROX.

[4]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[5]  Francesco Maffioli,et al.  A note on finding optimum branchings , 1979, Networks.

[6]  Robert E. Tarjan,et al.  Finding optimum branchings , 1977, Networks.

[7]  Sudipto Guha,et al.  Approximation algorithms for directed Steiner problems , 1999, SODA '98.

[8]  Robert E. Tarjan,et al.  Efficient algorithms for finding minimum spanning trees in undirected and directed graphs , 1986, Comb..

[9]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[10]  Richard M. Karp,et al.  A simple derivation of Edmonds' algorithm for optimum branchings , 1971, Networks.

[11]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[12]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Willem H. Buiter,et al.  Cambridge, MA 02138 , 1987 .

[15]  Ian H. Witten,et al.  Managing gigabytes 2nd edition , 1999 .

[16]  Fan Chung Graham,et al.  A random graph model for massive graphs , 2000, STOC '00.

[17]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[18]  Kenneth Ward Church,et al.  Engineering the compression of massive tables: an experimental approach , 2000, SODA '00.