Cache-Aware Kernel Tiling: An Approach for System-Level Performance Optimization of GPU-Based Applications

We present a software approach to address the data latency issue for certain GPU applications. Each application is modeled as a kernel graph, where the nodes represent individual GPU kernels and the edges capture data dependencies. Our technique exploits the GPU L2 cache to accelerate parameter passing between the kernels. The key idea is that, instead of having each kernel process the entire input in one invocation, we subdivide the input into fragments (which fit in the cache) and, ideally, process each fragment in one continuous sequence of kernel invocations. Our proposed technique is oblivious to kernel functionalities and requires minimal source code modification. We demonstrate our technique on a full-fledged image processing application and improve the performance on average by 30% over various settings.

[1]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[2]  Yuan Xie Future memory and interconnect technologies , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[3]  Yang Yi,et al.  Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs , 2016 .

[4]  Alessandro Farinelli,et al.  Optimising memory management for Belief Propagation in Junction Trees using GPGPUs , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[5]  Forrest N. Iandola,et al.  Communication-minimizing 2D convolution in GPU registers , 2013, 2013 IEEE International Conference on Image Processing.

[6]  Xiuhong Li,et al.  Efficient kernel management on GPUs , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[8]  Shimeng Yu,et al.  Emerging Memory Technologies: Recent Trends and Prospects , 2016, IEEE Solid-State Circuits Magazine.

[9]  P. Sadayappan,et al.  Resource conscious reuse-driven tiling for GPUs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[10]  Jagan Singh Meena,et al.  Overview of emerging nonvolatile memory technologies , 2014, Nanoscale Research Letters.