Efficient High Performance Computing on Heterogeneous Platforms

Heterogeneous platforms are mixes of different processing units in a compute node (e.g., CPUs+GPUs, CPU+MICs) or a chip package (e.g., APUs). This type of platforms keeps gaining popularity in various computer systems ranging from supercomputers to mobile devices. In this context, improving their efficiency and usability has become increasingly important. In this thesis, we develop systematic methods for a large variety of data parallel applications to efficiently utilize heterogeneous platforms. Specifically, (1) we evaluate the suitability of OpenCL as a unified programming model for heterogeneous computing and improve OpenCL's efficiency for programming heterogenous platforms; (2) we develop a workload partitioning framework to accelerate imbalanced applications on heterogeneous platforms, where we match the heterogeneity of the platform with the imbalance of the workload; (3) we propose a model-based prediction method to correctly and quickly predict the optimal workload partitioning, maximizing the performance gain while speeding up the partitioning process; (4) we generalize a systematic workload partitioning approach which improves performance for both balanced and imbalanced applications, for applications with different datasets and execution scenarios, and for platforms with different hardware mixes; (5) we design an application analyzer that analyzes application kernel structures and enables different partitioning strategies accordingly to obtain both high performance and wide applicability for workload partitioning on heterogeneous platforms. To summarize, this thesis demonstrates that heterogeneous platforms are the right solution, performance-wise, for many classes of data parallel applications, and shows how high performance can be achieved systematically.

[1]  Jürgen Teich,et al.  Frameworks for Multi-core Architectures: A Comprehensive Evaluation Using 2D/3D Image Registration , 2011, ARCS.

[2]  Alejandro Duran,et al.  Optimizing the Exploitation of Multicore Processors and GPUs with OpenMP and OpenCL , 2010, LCPC.

[3]  Firas Hamze,et al.  A Performance Comparison of CUDA and OpenCL , 2010, ArXiv.

[4]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[5]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[6]  Andrs Vajda Programming Many-Core Chips , 2011 .

[7]  Grigori Fursin,et al.  Predictive Runtime Code Scheduling for Heterogeneous Architectures , 2008, HiPEAC.

[8]  Bronis R. de Supinski,et al.  Heterogeneous Task Scheduling for Accelerated OpenMP , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[9]  Emmanuel Agullo,et al.  QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[11]  Maurice Steinman,et al.  AMD Fusion APU: Llano , 2012, IEEE Micro.

[12]  Jie Shen,et al.  Performance Gaps between OpenMP and OpenCL for Multi-core CPUs , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[13]  Mikko H. Lipasti,et al.  Simulating cortical networks on heterogeneous multi-GPU systems , 2013, J. Parallel Distributed Comput..

[14]  Nancy M. Amato,et al.  Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Song Huang,et al.  On the energy efficiency of graphics processing units for scientific computing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[17]  Zhengyu He,et al.  Dynamically tuned push-relabel algorithm for the maximum flow problem on CPU-GPU-Hybrid platforms , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[18]  Frank Mueller,et al.  Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[19]  Matei Ripeanu,et al.  A yoke of oxen and a thousand chickens for heavy lifting graph processing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Jie Shen,et al.  Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms , 2013, CF '13.

[21]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[22]  Thomas Fahringer,et al.  An automatic input-sensitive approach for heterogeneous task partitioning , 2013, ICS '13.

[23]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[24]  Gagan Agrawal,et al.  Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations , 2010, ICS '10.

[25]  Jie Shen,et al.  An application-centric evaluation of OpenCL on multi-core CPUs , 2013, Parallel Comput..

[26]  Wu-chun Feng,et al.  Architecture-Aware Mapping and Optimization on a 1600-Core GPU , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[27]  Jie Shen,et al.  OpenCL vs. OpenMP: A Programmability Debate , 2012 .

[28]  Surendra Byna,et al.  Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory , 2010, SPAA '10.

[29]  Satoshi Matsuoka,et al.  Power-aware dynamic task scheduling for heterogeneous accelerated clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[30]  Zhiyuan Li,et al.  Automatic tiling of iterative stencil loops , 2004, TOPL.

[31]  Kai Lu,et al.  The TianHe-1A Supercomputer: Its Hardware and Software , 2011, Journal of Computer Science and Technology.

[32]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[33]  Jie Shen,et al.  Performance Traps in OpenCL for CPUs , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[34]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[35]  Hendrikus G. Visser,et al.  A framework for simulation of aircraft flyover noise through a non-standard atmosphere , 2012 .

[36]  Jie Shen,et al.  Improving Application Performance by Efficiently Utilizing Heterogeneous Many-core Platforms , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[37]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[38]  Wei Chen,et al.  GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures , 2012, 2012 41st International Conference on Parallel Processing.

[39]  Thomas R. Gross,et al.  A library for portable and composable data locality optimizations for NUMA systems , 2015, PPOPP.

[40]  Kevin Skadron,et al.  Dymaxion: Optimizing memory access patterns for heterogeneous systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[41]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[42]  Marcelo Yuffe,et al.  A fully integrated multi-CPU, GPU and memory controller 32nm processor , 2011, 2011 IEEE International Solid-State Circuits Conference.

[43]  Olav Aanes Fagerlund Multi-core programming with OpenCL: performance and portability: OpenCL in a memory bound scenario , 2010 .

[44]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[45]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[46]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[47]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[48]  Jianliang Xu,et al.  GPURoofline: A Model for Guiding Performance Optimizations on GPUs , 2012, Euro-Par.

[49]  Jie Shen,et al.  Accelerating Cost Aggregation for Real-Time Stereo Matching , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[50]  A. Penders,et al.  Accelerating Graph Analysis with Heterogeneous Systems , 2012 .

[51]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[52]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[53]  Stephen L. Olivier,et al.  Comparison of OpenMP 3.0 and Other Task Parallel Frameworks on Unbalanced Task Graphs , 2010, International Journal of Parallel Programming.

[54]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[55]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[56]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[57]  Bixia Zheng,et al.  Twin Peaks: A Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[58]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[59]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[60]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[61]  Jie Shen,et al.  Matchmaking Applications and Partitioning Strategies for Efficient Execution on Heterogeneous Platforms , 2015, 2015 44th International Conference on Parallel Processing.

[62]  Kevin Skadron,et al.  Load balancing in a changing world: dealing with heterogeneity and performance variability , 2013, CF '13.

[63]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[64]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[65]  James C. Hoe,et al.  Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[66]  Kevin Skadron,et al.  HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[67]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[68]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[69]  Stephen A. Rizzi,et al.  Synthesis of Virtual Environments for Aircraft Community Noise Impact Studies , 2005 .

[70]  Jaejin Lee,et al.  Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[71]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[72]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[73]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[74]  Jie Shen,et al.  Look before You Leap: Using the Right Hardware Resources to Accelerate Applications , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[75]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[76]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[77]  Jack J. Dongarra,et al.  Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[78]  George Varghese,et al.  A 22nm IA multi-CPU and GPU System-on-Chip , 2012, 2012 IEEE International Solid-State Circuits Conference.

[79]  Richard W. Vuduc,et al.  Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.

[80]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[81]  Akila Gothandaraman,et al.  Comparing Hardware Accelerators in Scientific Applications: A Case Study , 2011, IEEE Transactions on Parallel and Distributed Systems.

[82]  Michael F. P. O'Boyle,et al.  OpenCL Task Partitioning in the Presence of GPU Contention , 2013, LCPC.

[83]  Kiran Kumar Matam,et al.  Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU , 2011, 2011 International Conference on Parallel Processing.

[84]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[85]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[86]  Lars Bergstrom,et al.  Measuring NUMA effects with the STREAM benchmark , 2011, ArXiv.

[87]  Kevin Skadron,et al.  Dynamic Heterogeneous Scheduling Decisions Using Historical Runtime Data , 2011 .

[88]  Christoph Kessler,et al.  OpenCL for programming shared memory multicore CPUs , 2011, HiPEAC 2012.

[89]  A. Choudhary,et al.  Nu-minebench 2.0 , 2005 .

[90]  Yong Kim,et al.  The 12-Core POWER8™ Processor With 7.6 Tb/s IO Bandwidth, Integrated Voltage Regulation, and Resonant Clocking , 2015, IEEE Journal of Solid-State Circuits.

[91]  Jie Shen,et al.  Improving performance by matching imbalanced workloads with heterogeneous platforms , 2014, ICS '14.

[92]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[93]  Eduard Ayguadé,et al.  Self-Adaptive OmpSs Tasks in Heterogeneous Environments , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[94]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[95]  Jason Maassen,et al.  Performance Models for CPU-GPU Data Transfers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[96]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[97]  Manolis Papadrakakis,et al.  A new era in scientific computing: Domain decomposition methods in hybrid CPU-GPU architectures , 2011 .

[98]  Marilyn Wolf High-Performance Embedded Computing: Applications in Cyber-Physical Systems and Mobile Computing , 2014 .

[99]  Sergei Gorlatch,et al.  Comparing programming models for medical imaging on multi‐core systems , 2011, Concurr. Comput. Pract. Exp..

[100]  Sungdae Cho,et al.  Design and Performance Evaluation of Image Processing Algorithms on GPUs , 2011, IEEE Transactions on Parallel and Distributed Systems.

[101]  Vijay Srinivasan,et al.  4.2 A 20nm 32-Core 64MB L3 cache SPARC M7 processor , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[102]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[103]  Jérémie Allard,et al.  Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations , 2010, Euro-Par.

[104]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[105]  Jie Shen,et al.  ELMO: A User-Friendly API to Enable Local Memory in OpenCL Kernels , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[106]  Guang R. Gao,et al.  Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor , 2009, Euro-Par.

[107]  Rainald Löhner,et al.  Running unstructured grid‐based CFD solvers on modern graphics hardware , 2009 .

[108]  Joseph JáJá,et al.  High Performance FFT Based Poisson Solver on a CPU-GPU Heterogeneous Platform , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[109]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[110]  Cees T. A. M. de Laat,et al.  An Empirical Evaluation of GPGPU Performance Models , 2014, Euro-Par Workshops.

[111]  Martin Lilleeng Sætra,et al.  Graphics processing unit (GPU) programming strategies and trends in GPU computing , 2013, J. Parallel Distributed Comput..

[112]  Yang Xunren,et al.  Computational atmospheric acoustics , 1997 .

[113]  Qi Hu,et al.  Scalable fast multipole methods on distributed heterogeneous architectures , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[114]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[115]  Matei Ripeanu,et al.  On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[116]  Mateus Pelegrino,et al.  Techniques for designing GPGPU games , 2012, 2012 IEEE International Games Innovation Conference.

[117]  Jungwon Kim,et al.  Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[118]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[119]  Wu-chun Feng,et al.  On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[120]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).