HADI : Fast Diameter Estimation and Mining in Massive Graphs with Hadoop

How can we quickly find the diameter of a petabyte-sized graph? Large graphs are ubiquitous: social networks (Facebook, LinkedIn, etc.), the World Wide Web, biological networks, computer networks and many more. The size of graphs of interest has been increasing rapidly in recent years and with it also the need for algorithms that can handle teraand peta-byte graphs. A promising direction for coping with such sizes is the emerging map/reduce architecture and its open-source implementation, ’HADOOP’. Estimating the diameter of a graph, as well as the radius of each node, is a valuable operation that can help us spot outliers and anomalies. We propose HADI (HAdoop based DIameter estimator), a carefully designed algorithm to compute the diameters of petabyte-scale graphs. We run the algorithm to analyze the largest public web graph ever analyzed, with billions of nodes and edges. Additional contributions include the following: (a) We propose several performance optimizations (b) we achieve excellent scale-up, and (c) we report interesting observations including outliers and related patterns, on this real graph (116Gb), as well as several other real, smaller graphs. One of the observations is that the Albert et al. conjecture about the diameter of the web is over-pessimistic.

[1]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[2]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[3]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[4]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[5]  Cherié L. Weible,et al.  The Internet Movie Database , 2001 .

[6]  Christos Faloutsos,et al.  ANF: a fast and scalable tool for data mining in massive graphs , 2002, KDD.

[7]  Duncan J. Watts,et al.  Six Degrees: The Science of a Connected Age , 2003 .

[8]  GhemawatSanjay,et al.  The Google file system , 2003 .

[9]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[12]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[13]  Jure Leskovec,et al.  Worldwide Buzz: Planetary-Scale Views on an Instant-Messaging Network , 2007, WWW 2008.

[14]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[15]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[16]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[17]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[18]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.