Iterative Computation of Connected Graph Components with MapReduce

The use of the MapReduce framework for iterative graph algorithms is challenging. To achieve high performance it is critical to limit the amount of intermediate results as well as the number of necessary iterations. We address these issues for the important problem of finding connected components in large graphs. We analyze an existing MapReduce algorithm, CC-MR, and present techniques to improve its performance including a memory-based connection of subgraphs in the map phase. Our evaluation with several large graph datasets shows that the improvements can substantially reduce the amount of generated data by up to a factor of 8.8 and runtime by up to factor of 3.5.

[1]  Yannis E. Ioannidis,et al.  On the Computation of the Transitive Closure of Relational Operators , 1986, VLDB.

[2]  Ashwin Machanavajjhala,et al.  Finding connected components in map-reduce in logarithmic rounds , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[3]  Patrick Valduriez,et al.  Parallel evaluation of the transitive closure of a database relation , 2005, International Journal of Parallel Programming.

[4]  Baruch Awerbuch,et al.  New Connectivity and MSF Algorithms for Shuffle-Exchange Network and PRAM , 1987, IEEE Transactions on Computers.

[5]  Erhard Rahm,et al.  Parallel Entity Resolution with Dedoop , 2012, Datenbank-Spektrum.

[6]  Dilip V. Sarwate,et al.  Computing connected components on parallel computers , 1979, CACM.

[7]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[8]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[9]  David Maier,et al.  Magic sets and other strange ways to implement logic programs (extended abstract) , 1985, PODS '86.

[10]  Pavel Tvrdík,et al.  A Parallel Algorithm for Connected Components on Distributed Memory Machines , 2001, PVM/MPI.

[11]  Uzi Vishkin,et al.  An O(log n) Parallel Connectivity Algorithm , 1982, J. Algorithms.

[12]  Michael D. Ernst,et al.  The HaLoop approach to large-scale iterative data analysis , 2012, The VLDB Journal.

[13]  Thomas Seidl,et al.  CC-MR - Finding Connected Components in Huge Graphs with MapReduce , 2012, ECML/PKDD.

[14]  Jeffrey D. Ullman,et al.  Map-reduce extensions and recursive queries , 2011, EDBT/ICDT '11.

[15]  John Greiner,et al.  A comparison of parallel algorithms for connected components , 1994, SPAA '94.

[16]  Jonathan Cohen,et al.  Graph Twiddling in a MapReduce World , 2009, Computing in Science & Engineering.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[19]  Erhard Rahm,et al.  BIIIG: Enabling business intelligence with integrated instance graphs , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[20]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[21]  Christophe de Maindreville,et al.  A Parallel Transitive Closure Algorithm Using Hash-Based Clustering , 1989, IWDM.

[22]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..