Avalanche: a fine-grained flow graph model for irregular applications on distributed-memory systems

Flow graph models have recently become increasingly popular as a way to express parallel computations. However, most of these models either require specialized languages and compilers or are library-based solutions requiring coarse-grained applications to achieve acceptable performance. Yet, graph algorithms and other irregular applications are increasingly important to modern high-performance computing, and these applications are not amenable to coarsening without complicating algorithm structure. One effective existing approach for these applications relies on active messages;. However, the separation of control flow between the main program and active message handlers introduces programming difficulties. To ameliorate this problem, we present Avalanche, a flow graph model for fine-grained applications that automatically generates active-message handlers. Avalanche is built as a C++ library on top of our previously-developed Active Pebbles model; a set of combinators builds graphs at compile-time, allowing several optimizations to be applied by the library and a standard C++ compiler. In particular, consecutive flow graph nodes can be fused; experimental results show that flow graphs built from small components can still efficiently operate on fine-grained data.

[1]  Jens Palsberg,et al.  Concurrent Collections , 2010, Sci. Program..

[2]  Laxmikant V. Kalé,et al.  Structured Dagger: A Coordination Language for Message-Driven Programming , 1996, Euro-Par, Vol. I.

[3]  Charles E. Leiserson,et al.  Executing task graphs using work-stealing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4]  Jack B. Dennis,et al.  First version of a data flow procedure language , 1974, Symposium on Programming.

[5]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[6]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[7]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[8]  Bradford L. Chamberlain,et al.  The cascade high productivity language , 2004, Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2004. Proceedings..

[9]  Douglas P. Gregor,et al.  The Parallel BGL : A Generic Library for Distributed Graph Computations , 2005 .

[10]  Torsten Hoefler,et al.  AM++: A generalized active message framework , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[12]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[13]  Mitchell Wand,et al.  Obtaining Coroutines with Continuations , 1986, Comput. Lang..

[14]  Daan Leijen,et al.  Domain specific embedded compilers , 1999, DSL '99.

[15]  Nancy M. Amato,et al.  STAPL: An Adaptive, Generic Parallel C++ Library , 2001, LCPC.

[16]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[17]  Paul Hudak,et al.  Functional reactive programming from first principles , 2000, PLDI '00.

[18]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[19]  Paul Hudak,et al.  Building domain-specific embedded languages , 1996, CSUR.

[20]  Keshav Pingali,et al.  Optimistic parallelism benefits from data partitioning , 2008, ASPLOS.

[21]  Michael D. McCool,et al.  Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[22]  Todd L. Veldhuizen,et al.  Active libraries and universal languages , 2004 .

[23]  M. Welsh,et al.  The Regiment Macroprogramming System , 2007, 2007 6th International Symposium on Information Processing in Sensor Networks.

[24]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[25]  Ryan Newton,et al.  Design and evaluation of a compiler for embedded stream programs , 2008, LCTES '08.

[26]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[27]  Todd L. Veldhuizen,et al.  Expression templates , 1996 .

[28]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[29]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[30]  John Hughes,et al.  Generalising monads to arrows , 2000, Sci. Comput. Program..

[31]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[32]  K. Ekanadham,et al.  The price of asynchronous parallelism: an analysis of dataflow architectures , 1989 .

[33]  G BakerHenry CONS should not CONS its arguments, part II , 1995 .

[34]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[35]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[36]  Jens Palsberg,et al.  Concurrent Collections , 2010 .

[37]  Henry G. Baker,et al.  CONS should not CONS its arguments, part II: Cheney on the M.T.A. , 1995, SIGP.

[38]  Stephen Weeks,et al.  Whole-program compilation in MLton , 2006, ML '06.

[39]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[40]  Philip Wadler,et al.  Deforestation: Transforming Programs to Eliminate Trees , 1990, Theor. Comput. Sci..

[41]  Don Syme,et al.  The F# Asynchronous Programming Model , 2011, PADL.

[42]  Torsten Hoefler,et al.  Active pebbles: parallel programming for data-driven applications , 2011, ICS '11.

[43]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[44]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[45]  Ross Paterson,et al.  A new notation for arrows , 2001, ICFP '01.