Signal / Collect Processing Large Graphs in Seconds

Both researchers and industry are confronted with the need to process increasingly large amounts of data, much of which has a natural graph representation. Some use MapReduce for scalable processing, but this abstraction is not designed for graphs and has shortcomings when it comes to both iterative and asynchronous processing, which are particularly important for graph algorithms. This paper presents the Signal/Collect programming model for scalable synchronous and asynchronous graph processing. We show that this abstraction can capture the essence of many algorithms on graphs in a concise and elegant way by giving Signal/Collect adaptations of algorithms that solve tasks as varied as clustering, inferencing, ranking, classification, constraint optimisation, and even query processing. Furthermore, we built and evaluated a parallel and distributed framework that executes algorithms in our programming model. We empirically show that our framework efficiently and scalably parallelises and distributes algorithms that are expressed in the programming model. We also show that asynchronicity can speed up execution times. Our framework can compute a PageRank on a large (>1.4 billion vertices, >6.6 billion edges) real-world graph in 112 seconds on eight machines, which is competitive with other graph processing approaches.

[1]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[2]  Zhuhua Cai,et al.  Facilitating real-time graph mining , 2012, CloudDB '12.

[3]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[4]  David A. Bader,et al.  SNAP, Small-world Network Analysis and Partitioning: An open-source parallel graph framework for the exploration of large-scale networks , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[5]  Jonathan Cohen,et al.  Graph Twiddling in a MapReduce World , 2009, Computing in Science & Engineering.

[6]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[7]  Ben Taskar,et al.  Probabilistic Models of Text and Link Structure for Hypertext Classification , 2001 .

[8]  Michael Isard,et al.  Distributed data-parallel computing using a high-level programming language , 2009, SIGMOD Conference.

[9]  Haixun Wang,et al.  The Trinity Graph Engine , 2012 .

[10]  Henri E. Bal,et al.  HipG: parallel processing of large-scale graphs , 2011, OPSR.

[11]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[12]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[13]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[14]  Andrew Lumsdaine,et al.  Lifting sequential graph algorithms for distributed-memory parallel computation , 2005, OOPSLA '05.

[15]  Master Gardener,et al.  Mathematical games: the fantastic combinations of john conway's new solitaire game "life , 1970 .

[16]  Robert L. Grossman,et al.  Processing massive sized graphs using Sector/Sphere , 2010, 2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers.

[17]  Foster Provost,et al.  A Simple Relational Classifier , 2003 .

[18]  Bingsheng He,et al.  Large graph processing in the cloud , 2010, SIGMOD Conference.

[19]  Mark S. Granovetter Threshold Models of Collective Behavior , 1978, American Journal of Sociology.

[20]  Jeremy G. Siek,et al.  The Boost Graph Library - User Guide and Reference Manual , 2001, C++ in-depth series.

[21]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[22]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[23]  Sreenivas Gollapudi,et al.  Of hammers and nails: an empirical comparison of three paradigms for processing large graphs , 2012, WSDM '12.

[24]  Christian Biemann,et al.  Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[25]  Martin Hilbert,et al.  The World’s Technological Capacity to Store, Communicate, and Compute Information , 2011, Science.

[26]  Carl Hewitt,et al.  A Universal Modular ACTOR Formalism for Artificial Intelligence , 1973, IJCAI.

[27]  André DeHon,et al.  Graph parallel actor language --- a programming language for parallel graph algorithms , 2013 .

[28]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[29]  Henri E. Bal,et al.  A High-Level Framework for Distributed Processing of Large-Scale Graphs , 2011, ICDCN.

[30]  Gerhard Weikum,et al.  x-RDF-3X , 2010, Proc. VLDB Endow..

[31]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[32]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[33]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[34]  Christos Faloutsos,et al.  Inference of Beliefs on Billion-Scale Graphs , 2010 .

[35]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[36]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[37]  Abraham Bernstein,et al.  Adding Data Mining Support to SPARQL Via Statistical Relational Learning Methods , 2008, ESWC.

[38]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[39]  Abraham Bernstein,et al.  Signal/Collect: Graph Algorithms for the (Semantic) Web , 2010, SEMWEB.

[40]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[41]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[42]  Douglas P. Gregor,et al.  The Parallel BGL : A Generic Library for Distributed Graph Computations , 2005 .

[43]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[44]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[45]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[46]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[47]  Abraham Bernstein,et al.  The Fundamentals of iSPARQL: A Virtual Triple Approach for Similarity-Based Semantic Web Tasks , 2007, ISWC/ASWC.

[48]  Joseph Gonzalez,et al.  GraphLab: A Distributed Framework for Machine Learning in the Cloud , 2011, ArXiv.

[49]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[50]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[51]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[52]  Nachiket Kapre,et al.  GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[53]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[54]  Przemyslaw Kazienko,et al.  Comparison of the Efficiency of MapReduce and Bulk Synchronous Parallel Approaches to Large Network Processing , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[55]  Jin-Soo Kim,et al.  HAMA: An Efficient Matrix Computation with the MapReduce Framework , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[56]  Yanfeng Zhang,et al.  PrIter: A Distributed Framework for Prioritizing Iterative Computations , 2011, IEEE Transactions on Parallel and Distributed Systems.

[57]  Yanfeng Zhang,et al.  iMapReduce: A Distributed Computing Framework for Iterative Computation , 2011, IPDPS Workshops.

[58]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[59]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[60]  Johannes Gehrke,et al.  Asynchronous Large-Scale Graph Processing Made Easy , 2013, CIDR.

[61]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[62]  Philipp Haller,et al.  Parallelizing Machine Learning- Functionally: A Framework and Abstractions for Parallel Graph Processing , 2011 .

[63]  Abraham Bernstein,et al.  TripleRush: A Fast and Scalable Triple Store , 2013, SSWS@ISWC.

[64]  Konstantin Andreev,et al.  Balanced Graph Partitioning , 2004, SPAA '04.