Active messages as a spanning model for parallel graph computation

Graph applications are members of an increasingly important class of applications that lack the natural, or domain-induced, locality of traditional computational science problems induced by large systems of PDEs. Rather than being analytically deducible, the dependency structure of graph applications is determined by the input graph itself. This data-carried dependency structure is expressed at run time and offers limited opportunities for static analysis. Graph applications present challenges with regard to load balancing, resource utilization, and concurrency at HPC scales. This thesis presents a set of parallel programming abstractions and a software-design methodology that allows for the implementation of flexible, scalable, and highly concurrent graph algorithms. The use of active messages as the underlying communication mechanism provides three key performance benefits over more coarse-grained approaches. First, this phrasing reduces global synchronization and exposes asynchrony that can be used to hide communication latency. Second, by executing and retiring messages as they are received, memory utilization is reduced. Finally, each active message represents an independent quantum of work that can be executed in parallel. By ensuring atomicity of the underlying vertex and edge properties manipulated by messages, fine-grained parallelism can be employed at the message level. The implementation of these ideas is presented in the context of the Parallel Boost Graph Library 2.0. This library is distinguished from other parallel graph implementations by two key features. By moving computation to data, rather than vice-versa, the effects of communication latency are reduced. Simultaneously, runtime optimization separates algorithm specifications from the underlying implementation. This allows optimization to be performed as the structure of the input graph, and thus the computation, is discovered. Separating specification from implementation also provides performance portability and enables retroactive optimization. The generic design of the library provides a common framework in which to experiment with dynamic graphs, dynamic runtimes, new algorithms, and new hardware resources. Most importantly, this thesis demonstrates that phrasing graph algorithms as collections of asynchronous, concurrent, message-driven fragments of code allows for natural expression of algorithms, flexible implementations leveraging various forms of parallelism, and performance portability--all without modifying the algorithm expressions themselves.

[1]  P. Geoffray Myrinet express (MX): Is your interconnect smart ? , 2004, Proceedings. Seventh International Conference on High Performance Computing and Grid in Asia Pacific Region, 2004..

[2]  K. Mani Chandy,et al.  How processes learn , 1985, ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing.

[3]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[4]  David A. Bader,et al.  An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances , 2007, ALENEX.

[5]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[6]  Sandeep Koranne,et al.  Boost C++ Libraries , 2011 .

[7]  B. Ramkumar,et al.  A dynamic and adaptive quiescence detection algorithmAmitabh , 1993 .

[8]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[9]  Brad Richards,et al.  Java-Based DSM with Object-Level Coherence Protocol Selection , 2003 .

[10]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[11]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[12]  Robert J. Harrison,et al.  Performance and experience with LAPI-a new high-performance communication library for the IBM RS/6000 SP , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[13]  David A. Bader,et al.  Massive streaming data analytics: A case study with clustering coefficients , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[14]  Andrew Lumsdaine,et al.  Effecting parallel graph eigensolvers through library composition , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[15]  Paul Erdös,et al.  On random graphs, I , 1959 .

[16]  Toyotaro Suzumura,et al.  Introducing ScaleGraph: an X10 library for billion scale graph analytics , 2012, X10 '12.

[17]  Jack J. Dongarra,et al.  Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[18]  Robert Thurlow,et al.  RPC: Remote Procedure Call Protocol Specification Version 2 , 2009, RFC.

[19]  Yogish Sabharwal,et al.  Software Routing and Aggregation of Messages to Optimize the Performance of HPCC Randomaccess Benchmark , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[20]  Vern Paxson,et al.  Bro: a system for detecting network intruders in real-time , 1998, Comput. Networks.

[21]  Richard P. Martin,et al.  Assessing Fast Network Interfaces , 1996, IEEE Micro.

[22]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[23]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[24]  Jim Waldo Remote procedure calls and Java Remote Method Invocation , 1998, IEEE Concurr..

[25]  Arnold L. Rosenberg,et al.  Graph Separators, with Applications , 2001, Frontiers of Computer Science.

[26]  P. Erdoes,et al.  On sparse graphs with dense long paths. , 1975 .

[27]  Jose Sreeram,et al.  UPC Queues for Scalable Graph Traversals: Design and Evaluation on InfiniBand Clusters , 2011 .

[28]  John H. Reif,et al.  Depth-First Search is Inherently Sequential , 1985, Inf. Process. Lett..

[29]  Biswanath Mukherjee,et al.  DIDS (distributed intrusion detection system)—motivation, architecture, and an early prototype , 1997 .

[30]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[31]  Robert C. Daley,et al.  The Multics virtual memory , 1972, Commun. ACM.

[32]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[33]  Jesse Davis,et al.  Method for module interaction in a Modular Architecture for Sensor Systems (MASS). , 2005 .

[34]  Jonathan W. Berry,et al.  Software and Algorithms for Graph Queries on Multithreaded Architectures , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[35]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[36]  Douglas P. Gregor,et al.  The Parallel BGL : A Generic Library for Distributed Graph Computations , 2005 .

[37]  Laxmikant V. Kalé,et al.  Chare Kernel - a Runtime Support System for Parallel Computations , 1991, J. Parallel Distributed Comput..

[38]  Sebastian Burckhardt,et al.  The design of a task parallel library , 2009, OOPSLA.

[39]  Salvatore J. Stolfo,et al.  Distributed data mining in credit card fraud detection , 1999, IEEE Intell. Syst..

[40]  Ulrich Meyer,et al.  Improved External Memory BFS Implementation , 2007, ALENEX.

[41]  Monika Henzinger,et al.  Maintaining Minimum Spanning Forests in Dynamic Graphs , 2001, SIAM J. Comput..

[42]  W. E Nagel 1988 International conference on supercomputing , 1988 .

[43]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[44]  Amith R. Mamidala,et al.  PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[45]  Torsten Hoefler,et al.  A space-efficient parallel algorithm for computing betweenness centrality in distributed memory , 2010, 2010 International Conference on High Performance Computing.

[46]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[47]  John R. Gilbert,et al.  Linear algebraic primitives for parallel computing on large graphs , 2010 .

[48]  Andrew Lumsdaine,et al.  Extensible PGAS semantics for C++ , 2010, PGAS '10.

[49]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[50]  John R. Gilbert,et al.  High-Performance Graph Algorithms from Parallel Sparse Matrices , 2006, PARA.

[51]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[52]  Ulrich Meyer,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[53]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[54]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[55]  David A. Bader,et al.  Parallel Algorithms for Evaluating Centrality Indices in Real-world Networks , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[56]  Friedemann Mattern,et al.  Algorithms for distributed termination detection , 1987, Distributed Computing.

[57]  Kurt Mehlhorn,et al.  A Parallelization of Dijkstra's Shortest Path Algorithm , 1998, MFCS.

[58]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[59]  Albert Chan,et al.  CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters , 2003, PVM/MPI.

[60]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2009, CACM.

[61]  D. Corneil,et al.  An Efficient Algorithm for Graph Isomorphism , 1970, JACM.

[62]  Andrew Lumsdaine,et al.  Single-Source Shortest Paths with the Parallel Boost Graph Library , 2006, The Shortest Path Problem.

[63]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[64]  David A. Bader,et al.  SNAP, Small-world Network Analysis and Partitioning: An open-source parallel graph framework for the exploration of large-scale networks , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[65]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[66]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[67]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[68]  John R. Gilbert,et al.  A Unified Framework for Numerical and Combinatorial Computing , 2008, Computing in Science & Engineering.

[69]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[70]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[71]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[72]  Bjarne Stroustrup,et al.  The Design and Evolution of C , 1994 .

[73]  John R. Gilbert,et al.  Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[74]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[75]  Kevin J. Lang Fixing two weaknesses of the Spectral Method , 2005, NIPS.

[76]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[77]  Torsten Hoefler,et al.  AM++: A generalized active message framework , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[78]  Santa Barbara,et al.  Linear Algebraic Primitives for Parallel Computing on Large Graphs , 2010 .

[79]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[80]  V. Jacobson,et al.  Congestion avoidance and control , 1988, CCRV.

[81]  Petr Konecny Introducing the Cray XMT , 2007 .

[82]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[83]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[84]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[85]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[86]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[87]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[88]  Matthew H. Austern Generic programming and the STL - using and extending the C++ standard template library , 1999, Addison-Wesley professional computing series.

[89]  K. Glasgow,et al.  Los Angeles, California , 2003 .

[90]  Andrew Lumsdaine,et al.  Lifting sequential graph algorithms for distributed-memory parallel computation , 2005, OOPSLA '05.

[91]  Carl Hewitt,et al.  The incremental garbage collection of processes , 1977, Artificial Intelligence and Programming Languages.

[92]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[93]  Jeremiah Willcock,et al.  Expressing graph algorithms using generalized active messages , 2013, PPoPP 2013.

[94]  Jack J. Dongarra,et al.  Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs , 1988, TOMS.

[95]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[96]  Lawrence Rauchwerger,et al.  Identifying Strongly Connected Components in Parallel , 2000, PPSC.

[97]  Veljko M. Milutinovic,et al.  Distributed shared memory: concepts and systems , 1997, IEEE Parallel Distributed Technol. Syst. Appl..

[98]  Maurice Herlihy,et al.  A methodology for implementing highly concurrent data objects , 1993, TOPL.

[99]  Francisco Jose Arzu Standard Templates Adaptive Parallel Library , 2000 .

[100]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[101]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[102]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[103]  Arti Mohanpurkar,et al.  Credit card fraud detection using Hidden Markov Model , 2011, 2011 World Congress on Information and Communication Technologies.

[104]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[105]  Andrew A. Chien,et al.  Architecture of a message-driven processor , 1987, ISCA '87.

[106]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[107]  Nancy M. Amato,et al.  The STAPL parallel container framework , 2011, PPoPP '11.

[108]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[109]  David Gelernter,et al.  Generative communication in Linda , 1985, TOPL.

[110]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[111]  Hans P. Zima,et al.  The cascade high productivity language , 2004 .

[112]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[113]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[114]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[115]  Daniele Frigioni,et al.  Fully Dynamic Algorithms for Maintaining Shortest Paths Trees , 2000, J. Algorithms.

[116]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[117]  Jeong-Hoon Lee,et al.  An In-depth Comparison of Subgraph Isomorphism Algorithms in Graph Databases , 2012, Proc. VLDB Endow..

[118]  Samuel T. Chanson,et al.  Process groups and group communications: classifications and requirements , 1990, Computer.

[119]  Torsten Hoefler,et al.  Active pebbles: parallel programming for data-driven applications , 2011, ICS '11.

[120]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[121]  Torsten Hoefler,et al.  Kanor - A Declarative Language for Explicit Communication , 2011, PADL.

[122]  José E. Moreira,et al.  Dissecting Cyclops: a detailed analysis of a multithreaded architecture , 2003, CARN.

[123]  Edsger W. Dijkstra,et al.  Termination Detection for Diffusing Computations , 1980, Inf. Process. Lett..

[124]  Jehoshua Bruck,et al.  Efficient algorithms for all-to-all communications in multi-port message-passing systems , 1994, SPAA '94.

[125]  Lawrence Rauchwerger,et al.  Armi: a High Level Communication Library for Stapl , 2006, Parallel Process. Lett..

[126]  Jaakko Järvi,et al.  Concept-Controlled Polymorphism , 2003, GPCE.

[127]  Ralph Duncan,et al.  A Survey of Parallel Computer , 1990 .

[128]  Uzi Vishkin,et al.  An O(log n) Parallel Connectivity Algorithm , 1982, J. Algorithms.

[129]  Maurice Herlihy,et al.  The Aleph Toolkit: Support for Scalable Distributed Shared Objects , 1999, CANPC.

[130]  Sartaj Sahni,et al.  Handbook of Data Structures and Applications , 2004 .

[131]  Anthony Skjellum,et al.  An initial implementation of MPI , 1993 .

[132]  Anoop Gupta,et al.  Interleaving: a multithreading technique targeting multiprocessors and workstations , 1994, ASPLOS VI.

[133]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[134]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[135]  Daniel P. Friedman,et al.  Aspects of Applicative Programming for Parallel Processing , 1978, IEEE Transactions on Computers.

[136]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[137]  Tamara G. Kolda,et al.  Community structure and scale-free collections of Erdös-Rényi graphs , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[138]  Sriram Krishnamoorthy,et al.  Global Futures: A Multithreaded Execution Model for Global Arrays-based Applications , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[139]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[140]  Satoru Kawai,et al.  An Algorithm for Drawing General Undirected Graphs , 1989, Inf. Process. Lett..

[141]  L. Ridgway Scott,et al.  Scientific Parallel Computing , 2005 .

[142]  Peter Sanders,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[143]  John R. Gilbert,et al.  Sparse Matrices in Matlab*P: Design and Implementation , 2004, HiPC.

[144]  Andrew Lumsdaine,et al.  PFunc: modern task parallelism for modern high performance computing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[145]  Feipei Lai,et al.  Adsmith: an efficient object-based distributed shared memory system on PVM , 1996, Proceedings Second International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN'96).

[146]  A. Gupta,et al.  Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results , 1989, ISCA '89.

[147]  Courtenay T. Vaughan,et al.  A Simple Synchronous Distributed-Memory Algorithm for the HPCC RandomAccess Benchmark , 2006, 2006 IEEE International Conference on Cluster Computing.

[148]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[149]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[150]  M. Snir,et al.  Ghost Cell Pattern , 2010, ParaPLoP '10.

[151]  Gul A. Agha,et al.  ACTORS - a model of concurrent computation in distributed systems , 1985, MIT Press series in artificial intelligence.

[152]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[153]  Jesse Davis,et al.  MASS: modular architecture for sensor systems , 2005, IPSN 2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005..

[154]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[155]  Reaz Hoque Corba 3 , 1998 .

[156]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[157]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[158]  Nissim Francez,et al.  Distributed Termination , 1980, TOPL.

[159]  Nancy M. Amato,et al.  STAPL: A Standard Template Adaptive Parallel C++ Library , 2001 .

[160]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[161]  Torsten Hoefler,et al.  Scalable communication protocols for dynamic sparse data exchange , 2010, PPoPP '10.

[162]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[163]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[164]  David A. Bader,et al.  Practical parallel algorithms for personalized communication and integer sorting , 1996, JEAL.

[165]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[166]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..