Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures

The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization. By combining fine-grained data dependencies with high-level control-flow, SDFGs are both expressive and amenable to program transformations, such as tiling and double-buffering. These transformations are applied to the SDFG in an interactive process, using extensible pattern matching, graph rewriting, and a graphical user interface. We demonstrate SDFGs on CPUs, GPUs, and FPGAs over various motifs --- from fundamental computational kernels to graph analytics. We show that SDFGs deliver competitive performance, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code.

[1]  Guido van Rossum,et al.  Python Programming Language , 2007, USENIX Annual Technical Conference.

[2]  Shoaib Kamil,et al.  Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[3]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[4]  Nancy M. Amato,et al.  STAPL: An Adaptive, Generic Parallel C++ Library , 2001, LCPC.

[5]  Vivek Sarkar,et al.  Polyhedral Optimizations for a Data-Flow Graph Language , 2015, LCPC.

[6]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[7]  Jin Zhou,et al.  Bamboo: a data-centric, object-oriented approach to many-core software , 2010, PLDI '10.

[8]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[9]  Shoaib Kamil,et al.  Tiramisu: A Code Optimization Framework for High Performance Systems , 2018 .

[10]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Michael Garland,et al.  Legate NumPy: accelerated and distributed array computing , 2019, SC.

[12]  Marco D. Santambrogio,et al.  A Unified Backend for Targeting FPGAs from DSLs , 2018, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[13]  Kunle Olukotun,et al.  Spatial: a language and compiler for application accelerators , 2018, PLDI.

[14]  Torsten Hoefler,et al.  A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations , 2019, SC.

[15]  Sarita V. Adve,et al.  HPVM: heterogeneous parallel virtual machine , 2018, PPoPP.

[16]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[18]  Torsten Hoefler,et al.  Transformations of High-Level Synthesis Codes for High-Performance Computing , 2018, IEEE Transactions on Parallel and Distributed Systems.

[19]  John Shalf,et al.  Trends in Data Locality Abstractions for HPC Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[20]  Amnon Barak,et al.  Memory access patterns: the missing piece of the multi-GPU puzzle , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[22]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[23]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[24]  Roberto Bruni,et al.  Operational Semantics of IMP , 2017 .

[25]  Hartmut Ehrig,et al.  Fundamentals of Algebraic Graph Transformation (Monographs in Theoretical Computer Science. An EATCS Series) , 1992 .

[26]  Alexander Aiken,et al.  Regent: a high-productivity programming language for HPC with logical regions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Georgi Gaydadjiev,et al.  Spatial Programming with OpenSPL , 2016, FPGAs for Software Programmers.

[28]  Sam Lindley,et al.  Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015, ICFP.

[29]  Michael Löwe,et al.  Algebraic Approach to Single-Pushout Graph Transformation , 1993, Theor. Comput. Sci..

[30]  Torsten Hoefler,et al.  To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations , 2017, HPDC.

[31]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[32]  Vivek Sarkar,et al.  PIPES: A Language and Compiler for Task-Based Programming on Distributed-Memory Clusters , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  W. Fichtner,et al.  Atomistic simulation of nanowires in the sp3d5s* tight-binding formalism: From boundary conditions to strain calculations , 2006 .

[34]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[35]  Jens Palsberg,et al.  Concurrent Collections , 2010 .

[36]  Franz Franchetti,et al.  From High-Level Specification to High-Performance Code , 2018, Proc. IEEE.

[37]  Alex Brooks,et al.  Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics , 2018, PLDI.

[38]  Jungwon Kim,et al.  OpenACC to FPGA: A Framework for Directive-Based High-Performance Reconfigurable Computing , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[39]  Mario Vento,et al.  A (sub)graph isomorphism algorithm for matching large graphs , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[41]  Elnar Hajiyev,et al.  PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[42]  Bradley N. Miller,et al.  The Python Programming Language , 2006 .

[43]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[44]  John D. Leidel,et al.  Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity , 2018 .

[45]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[46]  Tze Meng Low,et al.  SPIRAL: Extreme Performance Portability , 2018, Proceedings of the IEEE.

[47]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[48]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[49]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[50]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[51]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[52]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[53]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[54]  Eduard Ayguadé,et al.  Supporting stateful tasks in a dataflow graph , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[55]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[56]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[57]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[58]  David A. Bader,et al.  Graph Partitioning and Graph Clustering, 10th DIMACS Implementation Challenge Workshop, Georgia Institute of Technology, Atlanta, GA, USA, February 13-14, 2012. Proceedings , 2013, Graph Partitioning and Graph Clustering.

[59]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[60]  Chun Chen,et al.  A Programming Language Interface to Describe Transformations and Code Generation , 2010, LCPC.