Oncilla: A GAS runtime for efficient resource allocation and data movement in accelerated clusters

Accelerated and in-core implementations of Big Data applications typically require large amounts of host and accelerator memory as well as efficient mechanisms for transferring data to and from accelerators in heterogeneous clusters. Scheduling for heterogeneous CPU and GPU clusters has been investigated in depth in the high-performance computing (HPC) and cloud computing arenas, but there has been less emphasis on the management of cluster resource that is required to schedule applications across multiple nodes and devices. Previous approaches to address this resource management problem have focused on either using low-performance software layers or on adapting complex data movement techniques from the HPC arena, which reduces performance and creates barriers for migrating applications to new heterogeneous cluster architectures. This work proposes a new system architecture for cluster resource allocation and data movement built around the concept of managed Global Address Spaces (GAS), or dynamically aggregated memory regions that span multiple nodes.We propose a software layer called Oncilla that uses a simple runtime and API to take advantage of non-coherent hardware support for GAS. The Oncilla runtime is evaluated using two different high-performance networks for microkernels representative of the TPC-H data warehousing benchmark, and this runtime enables a reduction in runtime of up to 81%, on average, when compared with standard disk-based data storage techniques. The use of the Oncilla API is also evaluated for a simple breadth-first search (BFS) benchmark to demonstrate how existing applications can incorporate support for managed GAS.

[1]  Bingsheng He,et al.  Relational query coprocessing on graphics processors , 2009, TODS.

[2]  Courtenay T. Vaughan,et al.  Investigating the Impact of the Cielo Cray XE6 Architecture on Scientific Application Codes , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[3]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[4]  Michael Garland,et al.  Designing a unified programming model for heterogeneous machines , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[6]  Wei Jiang,et al.  Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[7]  Tong Liu,et al.  The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications , 2011, Computer Science - Research and Development.

[8]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Sudhakar Yalamanchili,et al.  Relational algorithms for multi-bulk-synchronous processors , 2013, PPoPP '13.

[10]  Massimo Bernaschi,et al.  Breadth First Search on APEnet+ , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[11]  Massimo Bernaschi,et al.  Benchmarking of communication techniques for GPUs , 2013, J. Parallel Distributed Comput..

[12]  Tetsu Narumi,et al.  DS-CUDA: A Middleware to Use Many GPUs in the Cloud Environment , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[13]  Ulrich Brüning,et al.  A Resource Optimized Remote-Memory-Access Architecture for Low-latency Communication , 2009, 2009 International Conference on Parallel Processing.

[14]  Holger Fröning,et al.  GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[15]  Holger Fröning,et al.  Efficient hardware support for the Partitioned Global Address Space , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[16]  Holger Fröning,et al.  On Achieving High Message Rates , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[17]  Federico Silla,et al.  Enabling CUDA acceleration within virtual machines using rCUDA , 2011, 2011 18th International Conference on High Performance Computing.

[18]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[19]  Mikyung Kang,et al.  Heterogeneous Cloud Computing , 2011, 2011 IEEE International Conference on Cluster Computing.

[20]  Parag Agrawal,et al.  The case for RAMCloud , 2011, Commun. ACM.

[21]  Werner Vogels,et al.  Eventually consistent , 2008, CACM.

[22]  Holger Fröning,et al.  MEMSCALE™: A Scalable Environment for Databases , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[23]  Andrew A. Chien,et al.  A software architecture for global address space communication on clusters: put/get on fast messages , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[24]  Sudhakar Yalamanchili,et al.  Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[25]  Sudhakar Yalamanchili,et al.  Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[26]  Carlos Reaño,et al.  CU2rCU: Towards the complete rCUDA remote GPU virtualization and sharing solution , 2012, 2012 19th International Conference on High Performance Computing.

[27]  Vishakha Gupta,et al.  Shadowfax: scaling in heterogeneous cluster systems via GPGPU assemblies , 2011, VTDC '11.

[28]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .