HPVM: heterogeneous parallel virtual machine

We propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs. Our representation, which we call HPVM, is a hierarchical dataflow graph with shared memory and vector instructions. HPVM supports three important capabilities for programming heterogeneous systems: a compiler intermediate representation (IR), a virtual instruction set (ISA), and a basis for runtime scheduling; previous systems focus on only one of these capabilities. As a compiler IR, HPVM aims to enable effective code generation and optimization for heterogeneous systems. As a virtual ISA, it can be used to ship executable programs, in order to achieve both functional portability and performance portability across such systems. At runtime, HPVM enables flexible scheduling policies, both through the graph structure and the ability to compile individual nodes in a program to any of the target devices on a system. We have implemented a prototype HPVM system, defining the HPVM IR as an extension of the LLVM compiler IR, compiler optimizations that operate directly on HPVM graphs, and code generators that translate the virtual ISA to NVIDIA GPUs, Intel's AVX vector units, and to multicore X86-64 processors. Experimental results show that HPVM optimizations achieve significant performance improvements, HPVM translators achieve performance competitive with manually developed OpenCL code for both GPUs and vector hardware, and that runtime scheduling policies can make use of both program and runtime information to exploit the flexible compilation capabilities. Overall, we conclude that the HPVM representation is a promising basis for achieving performance portability and for implementing parallelizing compilers for heterogeneous parallel systems.

[1]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[2]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[3]  Eduard Ayguadé,et al.  Supporting stateful tasks in a dataflow graph , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Wen-mei W. Hwu,et al.  Tangram: a High-level Language for Performance Portable Code Synthesis , 2015 .

[5]  Elnar Hajiyev,et al.  PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[6]  Rishiyur S. Nikhil The Parallel Programming Language Id and its Compilation for Parallel Machines , 1993, Int. J. High Speed Comput..

[7]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[8]  Jun Shirako,et al.  A Parallelizing Compiler Cooperative Heterogeneous Multicore Processor Architecture , 2011, Trans. High Perform. Embed. Archit. Compil..

[9]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Pierre Jouvelot,et al.  SPIRE, a Sequential to Parallel Intermediate Representation Extension , 2012 .

[11]  Stéphane Louise,et al.  Using an Intermediate Representation to Map Workloads on Heterogeneous Parallel Systems , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[12]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[13]  Nicolas Benoit,et al.  Extending GCC with a multigrain parallelism adaptation framework for MPSoCs , 2010 .

[14]  Charles E. Leiserson,et al.  Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation , 2017, PPoPP.

[15]  Seth Copen Goldstein,et al.  TAM - A Compiler Controlled Threaded Abstract Machine , 1993, J. Parallel Distributed Comput..

[16]  Eduard Ayguadé,et al.  Integrating Dataflow Abstractions into the Shared Memory Model , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[17]  Jens Palsberg,et al.  Concurrent Collections , 2010, Sci. Program..

[18]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[19]  Kunle Olukotun,et al.  Delite , 2014, ACM Trans. Embed. Comput. Syst..

[20]  Jack Dongarra,et al.  Pvm: A Users' Guide and Tutorial for Network Parallel Computing , 1994 .

[21]  Jose L. Ugia Gonzalez,et al.  Google Cloud Dataflow , 2015 .

[22]  Vivek Sarkar,et al.  Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement , 2009, LCPC.

[23]  Hironori Kasahara,et al.  Hierarchical macro-dataflow computation scheme , 1995, IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing. Proceedings.

[24]  Hironori Kasahara,et al.  Parallelization with Automatic Parallelizing Compiler Generating Consumer Electronics Multicore API , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[25]  William W. Wadge,et al.  Lucid, a nonprocedural language with iteration , 1977, CACM.

[26]  Keshav Pingali,et al.  Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations , 2017, PPoPP.

[27]  Jin Zhou,et al.  Bamboo: a data-centric, object-oriented approach to many-core software , 2010, PLDI '10.

[28]  Vivek Sarkar,et al.  Heterogeneous Habanero-C (H2C): A Portable Programming Model for Heterogeneous Processors , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[29]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[30]  Pierre Jouvelot,et al.  SPIRE : A Methodology for Sequential to Parallel Intermediate Representation Extension , 2013, ParCo 2013.

[31]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[32]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[33]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .