Dynamic Task Scheduling Scheme for a GPGPU Programming Framework

The computational power and the physical memory size of a single GPU device are often insufficient for large-scale problems. Using CUDA, the user must explicitly partition such problems into several tasks repeating the data transfer and kernel execution. To use multiple GPUs, explicit device switching is also needed. Furthermore, low-level hand optimizations such as load balancing and determining task granularity are required to achieve high performance. To handle large-scale problems without any additional user code, we introduce an implicit dynamic task scheduling scheme to our CUDA variation MESI-CUDA. MESI-CUDA is designed to abstract the low-level GPU features, virtual shared variables and logical thread mappings hide the complex memory hierarchy and physical characteristics. On the other hand, explicit parallel execution using kernel functions is the same as in CUDA. In our scheme, each kernel invocation in the user code is translated into a job submission to the runtime scheduler. The scheduler partitions a job into tasks considering the device memory size and dynamically schedules them to the available GPU devices. Thus the user can simply specify kernel invocations independent of the execution environment. The evaluation result shows that our scheme can automatically utilize heterogeneous GPU devices with small overhead.

[1]  Thomas B. Jablin,et al.  Automatic execution of single-GPU computations across multiple GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[2]  Kazuhiko Ohno,et al.  Compiler-Level Explicit Cache for a GPGPU Programming Framework , 2014 .

[3]  T. Steinke,et al.  On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[4]  Weng-Fai Wong,et al.  Scalable framework for mapping streaming applications onto multi-GPU systems , 2012, PPoPP '12.

[5]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[6]  Barbara M. Chapman,et al.  OpenMP , 2005, Parallel Comput..

[7]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[8]  Takahiro Sasaki,et al.  A GPGPU PROGRAMMING FRAMEWORK BASED ON A SHARED-MEMORY MODEL , 2013 .

[9]  Hiroshi Nakamura,et al.  Integrating Multi-GPU Execution in an OpenACC Compiler , 2013, 2013 42nd International Conference on Parallel Processing.

[10]  Masaki Matsumoto,et al.  SUPPORTING DYNAMIC DATA STRUCTURES IN A SHARED-MEMORY BASED GPGPU PROGRAMMING FRAMEWORK , 2012 .

[11]  Masaki Matsumoto,et al.  Automatic Optimization of Thread Mapping for a GPGPU Programming Framework , 2014, 2014 Second International Symposium on Computing and Networking.