Compressing the graph structure of the Web

A large amount of research has recently focused on the graph structure (or link structure) of the World Wide Web. This structure has proven to be extremely useful for improving the performance of search engines and other tools for navigating the Web. However, since the graphs in these scenarios involve hundreds of millions of nodes and even more edges, highly space-efficient data structures are needed to fit the data in memory. A first step in this direction was done by the DEC connectivity server, which stores the graph in compressed form. We describe techniques for compressing the graph structure of the Web, and give experimental results of a prototype implementation. We attempt to exploit a variety of different sources of compressibility of these graphs and of the associated set of URLs in order to obtain good compression performance on a large Web graph.

[1]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[2]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[3]  Koichi Takeda,et al.  Information retrieval on the web , 2000, CSUR.

[4]  John H. Reif,et al.  Efficient lossless compression of trees and graphs , 1996, Proceedings of Data Compression Conference - DCC '96.

[5]  Soumen Chakrabarti,et al.  Distributed Hypertext Resource Discovery Through Examples , 1999, VLDB.

[6]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[7]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[8]  Edward F. Grove,et al.  External-memory graph algorithms , 1995, SODA '95.

[9]  Narsingh Deo,et al.  A Structural Approach to Graph Compression , 1998 .

[10]  Jeffery R. Westbrook,et al.  Short Encodings of Planar Graphs and Maps , 1995, Discret. Appl. Math..

[11]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[12]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[13]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[14]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[15]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[16]  VitterJeffrey Scott External memory algorithms and data structures , 2001 .

[17]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[18]  Andrei Z. Broder,et al.  Informatin Retrieval on the Web , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[19]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[20]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[21]  Micah Adler,et al.  Towards compressing Web graphs , 2001, Proceedings DCC 2001. Data Compression Conference.