Automatically Migrating Sequential Applications to Heterogeneous System Architecture

Heterogeneous System Architecture (HSA) is a hardware architecture for heterogeneous computing proposed by the HSA Foundation. Its Unified Memory Architecture (hUMA) enables data sharing between heterogeneous devices and its userlevel Queuing Model (hQ) enables low overhead kernel launching. With such features, applications could enjoy more efficient and effective heterogeneous computing. However, most of today's heterogeneous-computing applications have not leveraged the hUMA and hQ features. Moreover, the majority of applications on the market are implemented in traditional sequential models. This paper looks at building a fully automatic framework to migrate sequential applications to HSA. The framework includes polyhedral-guided memory aliasing analysis, a staged dispatching predictor, and memory coalescing optimization. It also takes advantages of hUMA and hQ to achieve low overhead job dispatching on HSA-compliant systems. On an AMD Carrizo machine (HSA-compliant), a sequential application runs through our framework could be 8.66x faster on Carrizo than before. In several cases where workloads are considered insufficient to benefit from conventional or non-HSA heterogeneous computing, our framework could still deliver significant speedups. In addition, the performance obtained through our framework can sometimes exceed the performance gain from manual tuning for both HSA and non-HSA platforms, running on the same Carrizo machine. With this framework, many existing applications coded in traditional sequential models could get performance boost from HSA-based heterogeneous computing.

[1]  Raymond Lo,et al.  Loop induction variable canonicalization in parallelizing compilers , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[2]  Daniel J. Quinlan,et al.  Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions , 2010, International Journal of Parallel Programming.

[3]  Philippe Clauss,et al.  Runtime Vectorization Transformations of Binary Code , 2017, International Journal of Parallel Programming.

[4]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[5]  Phil Rogers,et al.  Heterogeneous system architecture overview , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[6]  Ronan Keryell,et al.  Par4All: From Convex Array Regions to Heterogeneous Computing , 2012, HiPEAC 2012.

[7]  Rudolf Eigenmann,et al.  Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.

[8]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[9]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[10]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[11]  Albert Cohen,et al.  Coarse-Grained Loop Parallelization: Iteration Space Slicing vs Affine Transformations , 2009, ISPDC.

[12]  Philippe Clauss,et al.  Polyhedral parallelization of binary code , 2012, TACO.

[13]  Albert Cohen,et al.  Putting Automatic Polyhedral Compilation for GPGPU to Work , 2011 .

[14]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[15]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[16]  Eddy Z. Zhang,et al.  KernelGen -- The Design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GPUs , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.