Look before You Leap: Using the Right Hardware Resources to Accelerate Applications

GPUs are widely used to accelerate data-parallel applications. However, while the GPU processing capability is enhanced in each generation, the CPU computing power is also increased by adding more cores and widening vector units. Compared to the rapid development of GPUs and CPUs, the bandwidth of the data transfer between GPUs and the host CPU grows much slower, resulting in a data-transfer wall for using GPUs. In this situation, choosing the right mix of hardware resources - i.e., The right hardware configuration - is critically important for improving application performance, and the right choice is a function of the available hardware resources as well as the application and the dataset to be used. In this paper, we present a systematic approach to determine the hardware configuration that leads to the best performance for a given workload. Our approach captures the variation of hardware capabilities and data-transfer overhead for different applications and datasets, and uses modeling and prediction techniques to determine the best-performing hardware configuration. We have tested our approach on 7 applications with 6 datasets per application. The results show that our approach takes the correct decision in 38 out of 42 test cases, achieving up to 12.6×/6.6× performance improvement compared to an uninformed Only-CPU/Only-GPU decision.

[1]  Jack J. Dongarra,et al.  Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[2]  K. Balakrishnan,et al.  A framework for performance modeling of SWIM , 2012, 2012 Integrated Communications, Navigation and Surveillance Conference.

[3]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Matei Ripeanu,et al.  A yoke of oxen and a thousand chickens for heavy lifting graph processing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Jie Shen,et al.  Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms , 2013, CF '13.

[6]  Jie Shen,et al.  An application-centric evaluation of OpenCL on multi-core CPUs , 2013, Parallel Comput..

[7]  Surendra Byna,et al.  Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory , 2010, SPAA '10.

[8]  Richard W. Vuduc,et al.  Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.

[9]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[10]  Jie Shen,et al.  Improving performance by matching imbalanced workloads with heterogeneous platforms , 2014, ICS '14.

[11]  Jérémie Allard,et al.  Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations , 2010, Euro-Par.

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[14]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[15]  Thomas Fahringer,et al.  An automatic input-sensitive approach for heterogeneous task partitioning , 2013, ICS '13.

[16]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[17]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[18]  Joseph JáJá,et al.  High Performance FFT Based Poisson Solver on a CPU-GPU Heterogeneous Platform , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[19]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[20]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).