Workload Partitioning for Accelerating Applications on Heterogeneous Platforms

Heterogeneous platforms composed of multi-core CPUs and different types of accelerators, like GPUs and Xeon Phi, are becoming popular for data parallel applications. The heterogeneity of the hardware mix and the diversity of the applications pose significant challenges to exploiting such platforms. In this situation, an effective workload partitioning between processing units is critically important for improving application performance. This partitioning is a function of the hardware capabilities as well as the application and the dataset to be used. In this work, we present a systematic approach to solve the partitioning problem. Specifically, we use modeling, profiling, and prediction techniques to quickly and correctly predict the optimal workload partitioning and the right hardware configuration to use. Our approach effectively characterizes the platform heterogeneity, efficiently determines the accurate partitioning, and easily adapts to new platforms, different application types, and different datasets. Experimental evaluation on 13 applications shows that our approach delivers excellent performance improvement of 1.2 <inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="shen-ieq1-2509972.gif"/></alternatives></inline-formula>–14.6<inline-formula> <tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="shen-ieq2-2509972.gif"/></alternatives></inline-formula> over a single-processor execution, and accurate partitioning with in most cases below 10 percent performance gap versus an oracle-based partitioning.

[1]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[3]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[4]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[5]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[6]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Richard W. Vuduc,et al.  Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.

[8]  KimHyesoon,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009 .

[9]  Surendra Byna,et al.  Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory , 2010, SPAA '10.

[10]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[11]  Gagan Agrawal,et al.  Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations , 2010, ICS '10.

[12]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[13]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[14]  Jérémie Allard,et al.  Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations , 2010, Euro-Par.

[15]  Jaejin Lee,et al.  Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[17]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[18]  Kiran Kumar Matam,et al.  Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU , 2011, 2011 International Conference on Parallel Processing.

[19]  Hendrikus G. Visser,et al.  A framework for simulation of aircraft flyover noise through a non-standard atmosphere , 2012 .

[20]  Wolfgang Karl,et al.  Seamlessly portable applications: Managing the diversity of modern heterogeneous systems , 2012, TACO.

[21]  Greg Stitt,et al.  The RACECAR heuristic for automatic function specialization on multi-core heterogeneous systems , 2012, CASES '12.

[22]  Matei Ripeanu,et al.  A yoke of oxen and a thousand chickens for heavy lifting graph processing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Wei Chen,et al.  GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures , 2012, 2012 41st International Conference on Parallel Processing.

[24]  Jack J. Dongarra,et al.  Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[25]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[26]  Kevin Skadron,et al.  Load balancing in a changing world: dealing with heterogeneity and performance variability , 2013, CF '13.

[27]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[28]  Thomas Fahringer,et al.  An automatic input-sensitive approach for heterogeneous task partitioning , 2013, ICS '13.

[29]  Matei Ripeanu,et al.  On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[30]  Jie Shen,et al.  An application-centric evaluation of OpenCL on multi-core CPUs , 2013, Parallel Comput..

[31]  Jie Shen,et al.  Performance Traps in OpenCL for CPUs , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[32]  Basilio B. Fraguela,et al.  Exploiting heterogeneous parallelism with the Heterogeneous Programming Library , 2013, J. Parallel Distributed Comput..

[33]  Jie Shen,et al.  Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms , 2013, CF '13.

[34]  Joseph JáJá,et al.  High Performance FFT Based Poisson Solver on a CPU-GPU Heterogeneous Platform , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[35]  Natalie D. Enright Jerger,et al.  DistCL: A Framework for the Distributed Execution of OpenCL Kernels , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[36]  Jie Shen,et al.  Look before You Leap: Using the Right Hardware Resources to Accelerate Applications , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[37]  Jie Shen,et al.  Improving performance by matching imbalanced workloads with heterogeneous platforms , 2014, ICS '14.

[38]  Cees T. A. M. de Laat,et al.  An Empirical Evaluation of GPGPU Performance Models , 2014, Euro-Par Workshops.

[39]  Jason Maassen,et al.  Performance Models for CPU-GPU Data Transfers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[40]  Jie Shen,et al.  Matchmaking Applications and Partitioning Strategies for Efficient Execution on Heterogeneous Platforms , 2015, 2015 44th International Conference on Parallel Processing.