Accelerating Data Analytics on Integrated GPU Platforms via Runtime Specialization

Integrated GPU systems are a cost-effective and energy-efficient option for accelerating data-intensive applications. While these platforms have reduced overhead of offloading computation to the GPU and potential for fine-grained resource scheduling, there remain several open challenges: (1) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (2) the complex architecture and programming models of such systems require substantial application knowledge to achieve high performance, and (3) as such systems become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such integrated GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Toward this end, this work proposes two novel schedulers with distinct goals: (a) a device-affinity, contention-aware scheduler that incorporates instrumentation-driven optimizations to improve the throughput of running diverse applications on integrated CPU–GPU servers, and (b) a specialized, affinity-aware work-stealing scheduler that efficiently distributes work across all CPU and GPU cores for the same application, taking into account both application characteristics and architectural differences of the underlying devices.

[1]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[2]  Rajkishore Barik,et al.  Efficient Mapping of Irregular C++ Applications to Integrated GPUs , 2014, CGO '14.

[3]  Michela Becchi,et al.  Deploying Graph Algorithms on GPUs: An Adaptive Solution , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[4]  Karthik Nilakant,et al.  On the Efficacy of APUs for Heterogeneous Graph Computation , 2014 .

[5]  Michael L. Scott,et al.  Disengaged scheduling for fair, protected access to fast computational accelerators , 2014, ASPLOS.

[6]  Jean-Philippe Martin,et al.  Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[7]  Shinpei Kato,et al.  Gdev: First-Class GPU Resource Management in the Operating System , 2012, USENIX Annual Technical Conference.

[8]  Albert Cohen,et al.  Correct and efficient work-stealing for weak memory models , 2013, PPoPP '13.

[9]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[10]  Lei Wang,et al.  An adaptive task creation strategy for work-stealing scheduling , 2010, CGO '10.

[11]  Gagan Agrawal,et al.  Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations , 2010, ICS '10.

[12]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[13]  Vanish Talwar,et al.  Evaluating integrated graphics processors for data center workloads , 2013, HotPower '13.

[14]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[15]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[16]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[17]  David Defour,et al.  Barra: A Parallel Functional Simulator for GPGPU , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[18]  Shinpei Kato,et al.  TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[19]  Sudhakar Yalamanchili,et al.  Red Fox: An Execution Environment for Relational Query Processing on GPUs , 2014, CGO '14.

[20]  Srimat T. Chakradhar,et al.  Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework , 2011, HPDC '11.

[21]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[22]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[23]  Wei Jiang,et al.  Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[24]  David Grove,et al.  Work-stealing without the baggage , 2012, OOPSLA '12.

[25]  Ling Liu,et al.  Efficient data partitioning model for heterogeneous graphs in the cloud , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[27]  Bruno Raffin,et al.  Locality-Aware Work Stealing on Multi-CPU and Multi-GPU Architectures , 2013 .

[28]  Yuxiong He,et al.  Adaptive work-stealing with parallelism feedback , 2008, TOCS.

[29]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  Vivek Sarkar,et al.  Dynamic Task Parallelism with a GPU Work-Stealing Runtime System , 2011, LCPC.

[31]  Bronis R. de Supinski,et al.  Heterogeneous Task Scheduling for Accelerated OpenMP , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[32]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[33]  Yu David Liu,et al.  Energy-efficient work-stealing language runtimes , 2014, ASPLOS.

[34]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[35]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[36]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[37]  Keshav Pingali,et al.  Adaptive heterogeneous scheduling for integrated GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[38]  David Parello,et al.  Barra, a Modular Functional GPU Simulator for GPGPU , 2009 .

[39]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[40]  Tao Li,et al.  Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[41]  Katherine Yelick,et al.  Hierarchical Work Stealing on Manycore Clusters , 2011 .

[42]  Srihari Cadambi,et al.  Interference-driven resource management for GPU-based heterogeneous clusters , 2012, HPDC '12.

[43]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[44]  Karsten Schwan,et al.  Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[45]  Vanish Talwar,et al.  Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems , 2011, USENIX ATC.

[46]  Jungwon Kim,et al.  Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[47]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[48]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[49]  Jason Cong,et al.  Mapping a data-flow programming model onto heterogeneous platforms , 2012, LCTES '12.

[50]  Grigori Fursin,et al.  Predictive Runtime Code Scheduling for Heterogeneous Architectures , 2008, HiPEAC.

[51]  Andrew E. Turner,et al.  Visualizing complex dynamics in many-core accelerator architectures , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[52]  Michael A. Bender,et al.  Scheduling Cilk multithreaded parallel programs on processors of different speeds , 2000, SPAA.

[53]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[54]  Srimat T. Chakradhar,et al.  A virtual memory based runtime to support multi-tenancy in clusters with GPUs , 2012, HPDC '12.

[55]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[56]  Kevin Skadron,et al.  Load balancing in a changing world: dealing with heterogeneity and performance variability , 2013, CF '13.

[57]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[58]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[59]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[60]  R. Govindarajan,et al.  Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.