An evaluation study of BigData frameworks for graph processing

When Google first introduced the Map/Reduce paradigm in 2004, no comparable system had been available to the general public. The situation has changed since then. The Map/Reduce paradigm has become increasingly popular and there is no shortage of Map/Reduce implementations in today's computing world. The predominant solution is currently Apache Hadoop, started by Yahoo. Besides employing custom Map/Reduce installations, customers of cloud services can now exploit ready-made made installations (e.g. the Elastic Map/Reduce System). In the mean time, other, second generation frameworks have started to appear. They either fine tune the Map/Reduce model for specific scenarios, or change the paradigm altogether, such as Google's Pregel. In this paper, we present a comparison between these second generation frameworks and the current de-facto standard Hadoop, by focusing on a specific scenario: large-scale graph analysis. We analyze the different means of fine-tuning those systems by exploiting their unique features. We base our analysis on the k-core decomposition problem, whose goal is to compute the centrality of each node in a given graph; we tested our implementation in a cluster of Amazon EC2 nodes with realistic datasets made publicly available by the SNAP project.

[1]  Stephen B. Seidman,et al.  Network structure and minimum degree , 1983 .

[2]  Alessandro Vespignani,et al.  Large scale networks fingerprinting and visualization using the k-core decomposition , 2005, NIPS.

[3]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[4]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[5]  Gary D Bader,et al.  Analyzing yeast protein–protein interaction data obtained from different sources , 2002, Nature Biotechnology.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[8]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[9]  Francesco De Pellegrini,et al.  General , 1895, The Social History of Alcohol Review.

[10]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[11]  Steven Hand,et al.  The Seven Deadly Sins of Cloud Computing Research , 2012, HotCloud.

[12]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[13]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[14]  Jonathan Cohen,et al.  Graph Twiddling in a MapReduce World , 2009, Computing in Science & Engineering.

[15]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[16]  Astrid Rheinländer,et al.  Opening the Black Boxes in Data Flow Optimization , 2012, Proc. VLDB Endow..

[17]  Jin-Soo Kim,et al.  HAMA: An Efficient Matrix Computation with the MapReduce Framework , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[18]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[19]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[20]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.