Evaluating the Scaling of Graph-Algorithms for Big Data Using GraphX

Graph processing has achieved a lot of attention in different big data scenarios. In this paper, we present the design, implementation, and experimental evaluation of graph processing algorithms in two different application areas. First, we use semi-clustering as an example of an algorithm typically used social network analysis. Then, we examine an algorithm for collaborative filtering as typically used in E-Commerce scenarios. For both algorithms, we make use of Apache GraphX as an existing distributed graph processing framework based on Apache Spark. As GraphX does not include these two algorithms, we describe how to implement them using a combination of GraphX and the underlying Spark Core. Based on our implementation, we perform experiments to test the scalability of both the algorithms and the GraphX processing framework. The experiments show that different kinds of graph algorithms can be supported within the Spark framework. Furthermore, we show that for our test data the algorithms scale almost linearly when properly designed.

[1]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[2]  Hsinchun Chen,et al.  Link prediction approach to collaborative filtering , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[3]  John A. Miller,et al.  Techniques for Graph Analytics on Big Data , 2013, 2013 IEEE International Congress on Big Data.

[4]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[5]  Johnny S. Wong,et al.  A Brief Review on Leading Big Data Models , 2014, Data Sci. J..

[6]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[7]  Ashwin Machanavajjhala,et al.  Feed following: the big data challenge in social applications , 2011, DBSocial '11.

[8]  Olaf Zukunft,et al.  Semi-clustering That Scales: An Empirical Evaluation of GraphX , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[9]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[10]  Reinhard Diestel,et al.  Graph Theory, 4th Edition , 2012, Graduate texts in mathematics.

[11]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[12]  Tom A. B. Snijders,et al.  Social Network Analysis , 2011, International Encyclopedia of Statistical Science.

[13]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[14]  Reynold Xin,et al.  GraphX: Unifying Data-Parallel and Graph-Parallel Analytics , 2014, ArXiv.

[15]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[16]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[17]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[18]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.