Big graph processing has been widely used in various computational domains, ranging from language modeling to social networks. Graph-parallel systems have been proposed to process such big graphs on clusters with up to hundreds of nodes. However, the size of a big graph often exceeds the available main memories in a small cluster. As a consequence, task failures happen frequently. To address this problem, we propose SGraph, a distributed streaming graph processing system built on top of Spark. SGraph introduces a streaming data model to avoid loading all of the graph data which may exceed the available RAM space. In addition, SGraph leverages an edge-centric scatter-gather computing model that can be used to conveniently implement graph algorithms. Experiments demonstrate that SGraph can process graphs with up to 1.5 billion edges on small clusters with several low-cost commodity PCs, whereas existing systems may require up to tens or hundreds of high-end machines. Furthermore, SGraph is up to 2.3 times faster than existing systems.
[1]
Rajeev Motwani,et al.
The PageRank Citation Ranking : Bringing Order to the Web
,
1999,
WWW 1999.
[2]
Michael J. Franklin,et al.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
,
2012,
NSDI.
[3]
Hosung Park,et al.
What is Twitter, a social network or a news media?
,
2010,
WWW '10.
[4]
Aart J. C. Bik,et al.
Pregel: a system for large-scale graph processing
,
2010,
SIGMOD Conference.
[5]
Joseph Gonzalez,et al.
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
,
2012,
OSDI.
[6]
Willy Zwaenepoel,et al.
X-Stream: edge-centric graph processing using streaming partitions
,
2013,
SOSP.
[7]
Theodore L. Willke,et al.
GraphBuilder: scalable graph ETL framework
,
2013,
GRADES.