Benchmarking of Distributed Computing Engines Spark and GraphLab for Big Data Analytics

In this paper we evaluate and compare two representativeand popular distributed processing engines for large scalebig data analytics, Spark and graph based engine GraphLab. Wedesign a benchmark suite including representative algorithmsand datasets to compare the performances of the computingengines, from performance aspects of running time, memory andCPU usage, network and I/O overhead. The benchmark suite istested on both local computer cluster and virtual machines oncloud. By varying the number of computers and memory weexamine the scalability of the computing engines with increasingcomputing resources (such as CPU and memory). We also runcross-evaluation of generic and graph based analytic algorithmsover graph processing and generic platforms to identify thepotential performance degradation if only one processing engineis available. It is observed that both computing engines showgood scalability with increase of computing resources. WhileGraphLab largely outperforms Spark for graph algorithms, ithas close running time performance as Spark for non-graphalgorithms. Additionally the running time with Spark for graphalgorithms over cloud virtual machines is observed to increaseby almost 100% compared to over local computer clusters.

[1]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[2]  Alexandru Iosup,et al.  Benchmarking graph-processing platforms: a vision , 2014, ICPE.

[3]  Alexandru Iosup,et al.  Graphalytics: A Big Data Benchmark for Graph-Processing Platforms , 2015, GRADES@SIGMOD/PODS.

[4]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[5]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[6]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[7]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[8]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[9]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[10]  Yi Lu,et al.  Large-Scale Distributed Graph Computing Systems: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[11]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[12]  Dan Meng,et al.  An evaluation and analysis of graph processing frameworks on five key issues , 2015, Conf. Computing Frontiers.

[13]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[14]  Alexandru Iosup,et al.  How Well Do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[15]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.