The Web as a graph: How far we are

In this article we present an experimental study of the properties of webgraphs. We study a large crawl from 2001 of 200M pages and about 1.4 billion edges, made available by the WebBase project at Stanford, as well as several synthetic ones generated according to various models proposed recently. We investigate several topological properties of such graphs, including the number of bipartite cores and strongly connected components, the distribution of degrees and PageRank values and some correlations; we present a comparison study of the models against these measures.Our findings are that (i) the WebBase sample differs slightly from the (older) samples studied in the literature, and (ii) despite the fact that these models do not catch all of its properties, they do exhibit some peculiar behaviors not found, for example, in the models from classical random graph theory.Moreover we developed a software library able to generate and measure massive graphs in secondary memory; this library is publicy available under the GPL licence. We discuss its implementation and some computational issues related to secondary memory graph algorithms.

[1]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[2]  Ravi Kumar,et al.  Self-similarity in the web , 2001, TOIT.

[3]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, II: Hierarchical multilevel memories , 1992, Algorithmica.

[4]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[5]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[6]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[7]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[8]  David F. Gleich,et al.  Fast Parallel PageRank: A Linear System Approach , 2004 .

[9]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[10]  Debora Donato,et al.  Large scale properties of the Webgraph , 2004 .

[11]  Micah Adler,et al.  Towards compressing Web graphs , 2001, Proceedings DCC 2001. Data Compression Conference.

[12]  Reiner Kraft,et al.  TimeLinks : Exploring the link structure of the evolving Web , 2003 .

[13]  Guido Caldarelli,et al.  A Multi-Layer Model for the Web Graph , 2002, WebDyn@WWW.

[14]  Reinhard Diestel,et al.  Graph Theory , 1997 .

[15]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[16]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[17]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, I: Two-level memories , 2005, Algorithmica.

[18]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[19]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[21]  Sriram Raghavan,et al.  Stanford WebBase components and applications , 2006, TOIT.

[22]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[23]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[24]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[25]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[26]  Eli Upfal,et al.  Using PageRank to Characterize Web Structure , 2002, Internet Math..

[27]  Frank Harary,et al.  Graph Theory , 2016 .

[28]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[29]  Panos M. Pardalos,et al.  Handbook of Massive Data Sets , 2002, Massive Computing.

[30]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[31]  Béla Bollobás,et al.  Robustness and Vulnerability of Scale-Free Random Graphs , 2004, Internet Math..

[32]  Alastair J. Walker,et al.  An Efficient Method for Generating Discrete Random Variables with General Distributions , 1977, TOMS.

[33]  Ulrich Meyer,et al.  Heuristics for semi-external depth first search on directed graphs , 2002, SPAA '02.

[34]  Debora Donato,et al.  Simulating the Webgraph: a comparative analysis of models , 2004, Comput. Sci. Eng..

[35]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[36]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.