PALMOS: A Transparent, Multi-tasking Acceleration Layer for Parallel Heterogeneous Systems

Accelerators, such as Graphic Processing Units (GPUs), are increasingly popular components of modern parallel systems. This move towards heterogeneity, however, has not progressed through all layers of system software. There is no transparent Operating System (OS) support for the management and sharing of accelerators between users and applications. Consequently, there is also no support for OS-level virtualization (containers) targeting heterogeneous software. This paper presents a secure, user-space virtualization layer that integrates the accelerator resources of a system with the standard multi-tasking and user-space virtualization facilities of commodity Linux OS. It targets heterogeneous commodity systems found in data center nodes and requires no modification to the OS, OpenCL or application. It eliminates high setup overhead, enables fine-grained sharing of mixed-vendor accelerator resources and provides resource and platform aware scheduling. The average throughput improvement across workloads and mixed-vendor platform configurations varies from 1.29x to 3.87x speedup over existing schemes. Our approach outperforms both vendor accelerator sharing facilities and message passing solutions.

[1]  Peter Druschel,et al.  Resource containers: a new facility for resource management in server systems , 1999, OSDI '99.

[2]  Michael L. Scott,et al.  Disengaged scheduling for fair, protected access to fast computational accelerators , 2014, ASPLOS.

[3]  Collin McCurdy,et al.  Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[4]  Rajkishore Barik,et al.  Efficient Mapping of Irregular C++ Applications to Integrated GPUs , 2014, CGO '14.

[5]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[6]  Srimat T. Chakradhar,et al.  A virtual memory based runtime to support multi-tenancy in clusters with GPUs , 2012, HPDC '12.

[7]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[8]  Serge E. Hallyn,et al.  Linux capabilities: making them work , 2008 .

[9]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[10]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[11]  Bowen Alpern,et al.  PDS: a virtual execution environment for software deployment , 2005, VEE '05.

[12]  Shinpei Kato,et al.  TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[13]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[14]  Lingjia Tang,et al.  Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[15]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[16]  John Kubiatowicz,et al.  Tessellation: Refactoring the OS around explicit resource containers with continuous adaptation , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[17]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[18]  Thomas Fahringer,et al.  LibWater: heterogeneous distributed computing made easy , 2013, ICS '13.

[19]  Larry L. Peterson,et al.  Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors , 2007, EuroSys '07.

[20]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[21]  Jungwon Kim,et al.  Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[22]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[23]  Xiaoyun Zhu,et al.  Capacity and Performance Overhead in Dynamic Resource Allocation to Virtual Containers , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[24]  Jeffrey S. Vetter,et al.  Quantifying NUMA and contention effects in multi-GPU systems , 2011, GPGPU-4.

[25]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[26]  Massimiliano Fatica,et al.  Multi-GPU Programming , 2014 .

[27]  Marianne Shaw,et al.  Scale and performance in the Denali isolation kernel , 2002, OSDI '02.

[28]  Federico Silla,et al.  rCUDA: Reducing the number of GPU-based accelerators in high performance clusters , 2010, 2010 International Conference on High Performance Computing & Simulation.

[29]  Wu-chun Feng,et al.  VOCL: An optimized environment for transparent virtualization of graphics processing units , 2012, 2012 Innovative Parallel Computing (InPar).

[30]  Andrew Birrell,et al.  Implementing remote procedure calls , 1984, TOCS.

[31]  Kathryn S. McKinley,et al.  Composing high-performance memory allocators , 2001, PLDI '01.

[32]  Feng Ji,et al.  RSVM: A Region-based Software Virtual Memory for GPU , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[33]  P. Menage Adding Generic Process Containers to the Linux Kernel , 2010 .

[34]  Dejan S. Milojicic,et al.  Exploring the performance and mapping of HPC applications to platforms in the cloud , 2012, HPDC '12.

[35]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[36]  Kevin Skadron,et al.  Enabling Task Parallelism in the CUDA Scheduler , 2009 .