FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures

The integrated architecture that features both CPU and GPU on the same die is an emerging and promising architecture for fine-grained CPU-GPU collaboration. However, the integration also brings forward several programming and system optimization challenges, especially for irregular applications. The complex interplay between heterogeneity and irregularity leads to very low processor utilization of running irregular applications on integrated architectures. Furthermore, fine-grained co-processing on the CPU and GPU is still an open problem. Particularly, in this paper, we show that the previous workload partitioning for CPU-GPU coprocessing is far from ideal in terms of resource utilization and performance. To solve this problem, we propose a system software named FinePar, which considers architectural differences of the CPU and GPU and leverages finegrained collaboration enabled by integrated architectures. Through irregularity-aware performance modeling and online auto-tuning, FinePar partitions irregular workloads and achieves both device-level and thread-level load balance. We evaluate FinePar with 8 irregular applications on an AMD integrated architecture and compare it with state-of-the-art partitioning approaches. Results show that FinePar demonstrates better resource utilization and achieves an average of 1.38X speedup over the optimal coarse-grained partitioning method.

[1]  Airfares 2002Q,et al.  MULTIPLE LINEAR REGRESSION , 2006, Statistical Methods for Biomedical Research.

[2]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[3]  Kenli Li,et al.  Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling , 2015, IEEE Transactions on Parallel and Distributed Systems.

[4]  Mark J. Warshawsky,et al.  A Modern Approach , 2005 .

[5]  R. Govindarajan,et al.  Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.

[6]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[7]  K. Upton,et al.  A modern approach , 1995 .

[8]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[9]  Karthik Nilakant,et al.  On the Efficacy of APUs for Heterogeneous Graph Computation , 2014 .

[10]  Gagan Agrawal,et al.  Accelerating MapReduce on a coupled CPU-GPU architecture , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Ninghui Sun,et al.  SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[12]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[13]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Wenguang Chen,et al.  To Co-run, or Not to Co-run: A Performance Study on Integrated Architectures , 2015, 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[15]  Michael F. P. O'Boyle,et al.  Integrating profile-driven parallelism detection and machine-learning-based mapping , 2014, TACO.

[16]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[17]  Ching-Yung Lin,et al.  GraphBIG: understanding graph computing in the context of industrial solutions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Denis Barthou,et al.  Automatic OpenCL Task Adaptation for Heterogeneous Architectures , 2016, Euro-Par.

[19]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[21]  Mitesh R. Meswani,et al.  Efficient breadth-first search on a heterogeneous processor , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[22]  Shengen Yan,et al.  yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.

[23]  FaloutsosChristos,et al.  Random walk with restart: fast solutions and applications , 2008 .

[24]  Keshav Pingali,et al.  Adaptive heterogeneous scheduling for integrated GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[25]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[26]  Matei Ripeanu,et al.  A yoke of oxen and a thousand chickens for heavy lifting graph processing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Wenguang Chen,et al.  Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures , 2017, IEEE Transactions on Parallel and Distributed Systems.

[28]  Weifeng Liu,et al.  Parallel Transposition of Sparse Data Structures , 2016, ICS.

[29]  Kurt Keutzer,et al.  clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs , 2012, ICS '12.

[30]  Ben Sander,et al.  Applying AMD's Kaveri APU for heterogeneous computing , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[31]  Rajkishore Barik,et al.  Efficient Mapping of Irregular C++ Applications to Integrated GPUs , 2014, CGO '14.

[32]  Srinivasan Parthasarathy,et al.  Automatic Selection of Sparse Matrix Representation on GPUs , 2015, ICS.

[33]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[34]  Tarek S. Abdelrahman,et al.  Parallel Radix Sort on the AMD Fusion Accelerated Processing Unit , 2013, 2013 42nd International Conference on Parallel Processing.

[35]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[36]  Christos Faloutsos,et al.  Random walk with restart: fast solutions and applications , 2008, Knowledge and Information Systems.

[37]  Michael F. P. O'Boyle,et al.  Automatic Feature Generation for Machine Learning Based Optimizing Compilation , 2009, 2009 International Symposium on Code Generation and Optimization.

[38]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[39]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.