BiShard Parallel Processor: A Disk-Based Processing Engine for Billion-Scale Graphs

Abstract Processing very large graphs efficiently is a challenging task. Distributed graph processing systems process the billion-scale graphs efficiently but incur overheads of partitioning and distribution of the large graph over a cluster of nodes. In order to overcome these problems a disk-based engine, GraphChi was proposed recently that processes the graph in chunks on a single PC. GraphChi significantly outperformed all the representative distributed processing frameworks. Still, we observe that GraphChi incurs some serious degradation in performance due to 1) high number of non-sequential I/Os for processing every chunk of the graph; and 2) limited parallelism to process the graph. In this paper, we propose a novel engine named BiShard Parallel Processor (BSPP) to efficiently process billions-scale graphs on a single PC. We introduce a new storage structure BiShard. BiShard divides the large graph into subgraphs and maintains the in and out edges separately. This storage mechanism significantly reduces the number of non-sequential I/Os. We implement a new processing model named BiShard Parallel (BSP) on top of Bishard. BSP exploits the properties of Bishard to enable full CPU parallelism for processing the graph. Our experiments on real large graphs show that our solution significantly outperforms GraphChi.

[1]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[2]  Michael T. Goodrich,et al.  A bridging model for parallel computation, communication, and I/O , 1996, CSUR.

[3]  Jonathan Cohen,et al.  Graph Twiddling in a MapReduce World , 2009, Computing in Science & Engineering.

[4]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[5]  Torsten Suel,et al.  I/O-efficient techniques for computing pagerank , 2002, CIKM '02.

[6]  Josep-Lluís Larriba-Pey,et al.  Dex: high-performance exploration on large graphs for information retrieval , 2007, CIKM '07.

[7]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[8]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[9]  D. S. Sivia,et al.  Data Analysis , 1996, Encyclopedia of Evolutionary Psychological Science.

[10]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[11]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[12]  Jim Law,et al.  Review of "The boost graph library: user guide and reference manual by Jeremy G. Siek, Lie-Quan Lee, and Andrew Lumsdaine." Addison-Wesley 2002. , 2003, SOEN.

[13]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[14]  Michael A. Bender,et al.  Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[15]  Andrew V. Goldberg,et al.  Computing the shortest path: A search meets graph theory , 2005, SODA '05.

[16]  Nancy M. Amato,et al.  Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  L. Takac DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS , 2012 .

[18]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[19]  Komal Kumar Bhatia,et al.  Page Ranking Algorithms: A Survey , 2009, 2009 IEEE International Advance Computing Conference.

[20]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[21]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[22]  J. Baruch,et al.  Group formation , 1950 .

[23]  Enhong Chen,et al.  Kineograph: taking the pulse of a fast-changing and connected world , 2012, EuroSys '12.

[24]  Bingsheng He,et al.  Large graph processing in the cloud , 2010, SIGMOD Conference.

[25]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[26]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[27]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[28]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[29]  Jinha Kim,et al.  TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC , 2013, KDD.

[30]  Kurt Mehlhorn,et al.  The LEDA Platform of Combinatorial and Geometric Computing , 1997, ICALP.

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[33]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[34]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.