PEGASUS: mining peta-scale graphs

In this paper, we describe PeGaSus, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node, finding the connected components, and computing the importance score of nodes. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PeGaSus is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components, etc.) are essentially a repeated matrix-vector multiplication. In this paper, we describe a very important primitive for PeGaSus, called GIM-V (generalized iterated matrix-vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines, (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs, thanks to Yahoo!, with ≈ 6.7 billion edges.

[1]  Dilip V. Sarwate,et al.  Computing connected components on parallel computers , 1979, CACM.

[2]  Hannu Toivonen,et al.  Finding reliable subgraphs from large probabilistic graphs , 2008, Data Mining and Knowledge Discovery.

[3]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[4]  Mihail N. Kolountzakis,et al.  Approximate Triangle Counting , 2009, ArXiv.

[5]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[6]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[7]  Charalampos E. Tsourakakis Counting triangles in real-world networks using projections , 2011, Knowledge and Information Systems.

[8]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[9]  Robert L. Grossman,et al.  Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[10]  Anthony K. H. Tung,et al.  CSV: visualizing and mining cohesive subgraphs , 2008, SIGMOD Conference.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[13]  Nisheeth Shrivastava,et al.  Mining (Social) Network Graphs to Detect Random Link Attacks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[14]  Christos Faloutsos,et al.  DOULION: counting triangles in massive graphs with a coin , 2009, KDD.

[15]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[16]  Lawrence B. Holder,et al.  Subdue: compression-based frequent pattern discovery in graph data , 2005 .

[17]  Vipin Kumar,et al.  Parallel Multilevel series k-Way Partitioning Scheme for Irregular Graphs , 1999, SIAM Rev..

[18]  Philip S. Yu,et al.  A general framework for relation graph clustering , 2010, Knowledge and Information Systems.

[19]  Chen Wang,et al.  Scalable mining of large disk-based graph databases , 2004, KDD.

[20]  Philip S. Yu,et al.  Fast Graph Pattern Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[21]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[22]  Springer-Verlag London Limited Temporal relation co-clustering on directional social network and author-topic evolution , 2010 .

[23]  Eli Upfal,et al.  Using PageRank to Characterize Web Structure , 2002, COCOON.

[24]  Jaideep Srivastava,et al.  Simultaneously Finding Fundamental Articles and New Topics Using a Community Tracking Method , 2009, PAKDD.

[25]  Ana Paula Appel,et al.  Radius Plots for Mining Tera-byte Scale Graphs: Algorithms, Patterns, and Observations , 2010, SDM.

[26]  Christos Faloutsos,et al.  Weighted graphs and disconnected components: patterns and a generator , 2008, KDD.

[27]  Myra Spiliopoulou,et al.  Logic Programming to Address Issues of the Semantic Web , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[28]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[29]  Randy Goebel,et al.  Detecting Communities in Social Networks Using Max-Min Modularity , 2009, SDM.

[30]  Derek Greene,et al.  Partitioning large networks without breaking communities , 2010, Knowledge and Information Systems.

[31]  Robin I. M. Dunbar Grooming, Gossip and the Evolution of Language , 1996 .

[32]  Philip S. Yu,et al.  gPrune: A Constraint Pushing Framework for Graph Pattern Mining , 2007, PAKDD.

[33]  Vipin Kumar,et al.  Multilevel k-way hypergraph partitioning , 1999, DAC '99.

[34]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[35]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[36]  Mayur Datar,et al.  On the streaming model augmented with a sorting primitive , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[37]  John Greiner,et al.  A comparison of parallel algorithms for connected components , 1994, SPAA '94.

[38]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[39]  Jeffrey Xu Yu,et al.  Top-k Correlative Graph Mining , 2009, SDM.

[40]  Tanya Y. Berger-Wolf,et al.  Periodic subgraph mining in dynamic networks , 2010, Knowledge and Information Systems.

[41]  Tamara G. Kolda,et al.  Scalable Tensor Decompositions for Multi-aspect Data Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[42]  Jiawei Han,et al.  gApprox: Mining Frequent Approximate Patterns from a Massive Network , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[43]  Christos Faloutsos,et al.  Automatic multimedia cross-modal correlation discovery , 2004, KDD.

[44]  Ambuj K. Singh,et al.  GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[45]  George A. Vouros,et al.  Mapping Ontologies Elements using Features in a Latent Space , 2007 .

[46]  Uzi Vishkin,et al.  An O(log n) Parallel Connectivity Algorithm , 1982, J. Algorithms.

[47]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.