Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures

Architecture designers tend to integrate both CPUs and GPUs on the same chip to deliver energy-efficient designs. It is still an open problem to effectively leverage the advantages of both CPUs and GPUs on integrated architectures. In this work, we port 42 programs in Rodinia, Parboil, and Polybench benchmark suites and analyze the co-running behaviors of these programs on both AMD and Intel integrated architectures. We find that co-running performance is not always better than running the program only with CPUs or GPUs. Among these programs, only eight programs can benefit from the co-running, while 24 programs only using GPUs and seven programs only using CPUs achieve the best performance. The remaining three programs show little performance preference for different devices. Through extensive workload characterization analysis, we find that architecture differences between CPUs and GPUs and limited shared memory bandwidth are two main factors affecting current co-running performance. Since not all the programs can benefit from integrated architectures, we build an automatic decision-tree-based model to help application developers predict the co-running performance for a given CPU-only or GPU-only program. Results show that our model correctly predicts 14 programs out of 15 for evaluated programs. For a co-run friendly program, we further propose a profiling-based method to predict the optimal workload partition ratio between CPUs and GPUs. Results show that our model can achieve 87.7 percent of the optimal performance relative to the best partition. The co-running programs acquired with our method outperform the original CPU-only and GPU-only programs by 34.5 and 20.9 percent respectively.

[1]  Mike O'Connor,et al.  Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[2]  Huazhong Yang,et al.  Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion for Bus-Based Embedded SoCs , 2015 .

[3]  Fei He,et al.  Deadlock detection in FPGA design: A practical approach , 2015 .

[4]  Tarek S. Abdelrahman,et al.  Parallel Radix Sort on the AMD Fusion Accelerated Processing Unit , 2013, 2013 42nd International Conference on Parallel Processing.

[5]  Bingsheng He,et al.  In-Cache Query Co-Processing on Coupled CPU-GPU Architectures , 2014, Proc. VLDB Endow..

[6]  Maurice Steinman,et al.  AMD'S "LLANO" Fusion APU , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[7]  Wu-chun Feng,et al.  Performance characterization of data-intensive kernels on AMD Fusion architectures , 2012, Computer Science - Research and Development.

[8]  David Kaeli,et al.  Heterogeneous Computing with OpenCL 2.0 , 2015 .

[9]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[10]  Rajkishore Barik,et al.  Efficient Mapping of Irregular C++ Applications to Integrated GPUs , 2014, CGO '14.

[11]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[12]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13]  Mayank Daga,et al.  Exploiting Coarse-Grained Parallelism in B+ Tree Searches on an APU , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[14]  Keshav Pingali,et al.  Adaptive heterogeneous scheduling for integrated GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[15]  Dominik Grewe,et al.  Mapping parallel programs to heterogeneous multi-core systems , 2014 .

[16]  Ben Sander,et al.  Applying AMD's Kaveri APU for heterogeneous computing , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[17]  Wu-chun Feng,et al.  On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[18]  Zhen Lin,et al.  Implementation and evaluation of deep neural networks (DNN) on mainstream heterogeneous systems , 2014, APSys.

[19]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[20]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[21]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[23]  Mitesh R. Meswani,et al.  Efficient breadth-first search on a heterogeneous processor , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[24]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[25]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[26]  Gagan Agrawal,et al.  Accelerating MapReduce on a coupled CPU-GPU architecture , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Henri Calandra,et al.  Hybrid strategy for stencil computations on the APU , 2014 .

[28]  Wenguang Chen,et al.  To Co-run, or Not to Co-run: A Performance Study on Integrated Architectures , 2015, 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[29]  Dong Li,et al.  The tradeoffs of fused memory hierarchies in heterogeneous computing architectures , 2012, CF '12.

[30]  Parimala Thulasiraman,et al.  Designing APU Oriented Scientific Computing Applications in OpenCL , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[31]  Bingsheng He,et al.  OmniDB: Towards Portable and Efficient Query Processing on Parallel CPU/GPU Architectures , 2013, Proc. VLDB Endow..

[32]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[33]  Li Shen,et al.  Understanding Co-run Degradations on Integrated Heterogeneous Processors , 2014, LCPC.

[34]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).