Evaluating Multiple Streams on Heterogeneous Platforms

Using multiple streams can improve the overall system performance by mitigating the data transfer overhead on heterogeneous systems. Prior work focuses a lot on GPUs but little is known about the performance impact on (Intel Xeon) Phi. In this work, we apply multiple streams into six real-world applications on Phi. We then systematically evaluate the performance benefits of using multiple streams. The evaluation work is performed at two levels: the microbenchmarking level and the real-world application level. Our experimental results at the microbenchmark level show that data transfers and kernel execution can be overlapped on Phi, while data transfers in both directions are performed in a serial manner. At the real-world application level, we show that both overlappable and non-overlappable applications can benefit from using multiple streams (with an performance improvement of up to 24%). We also quantify how task granularity and resource granularity impact the overall performance. Finally, we present a...

[1]  Hiroaki Kobayashi,et al.  SPRAT: Runtime processor selection for energy-aware computing , 2008, 2008 IEEE International Conference on Cluster Computing.

[2]  Eduard Ayguadé,et al.  AMA: Asynchronous Management of Accelerators for Task-based Programming Models , 2015, ICCS.

[3]  Scott B. Baden,et al.  Modeling and predicting performance of high performance computing applications on hardware accelerators , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[4]  José Ignacio Benavides Benítez,et al.  Performance models for asynchronous data transfers on consumer Graphics Processing Units , 2012, J. Parallel Distributed Comput..

[5]  Fumihiko Ino,et al.  GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems , 2013, IEICE Trans. Inf. Syst..

[6]  Jiayuan Meng,et al.  Improving GPU Performance Prediction with Data Transfer Modeling , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[7]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[8]  Kai Lu,et al.  Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing , 2010, 2010 IEEE International Conference on Cluster Computing.

[9]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[10]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[11]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[12]  Thomas Steinke,et al.  Multi-threaded Kernel Offloading to GPGPU Using Hyper-Q on Kepler Architecture , 2014 .

[13]  Alejandro Duran,et al.  Heterogeneous Streaming , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[14]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[15]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[16]  Jason Maassen,et al.  Performance Models for CPU-GPU Data Transfers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[17]  Zheng Gong,et al.  Software pipelining for graphic processing unit acceleration: Partition, scheduling and granularity , 2016, Int. J. High Perform. Comput. Appl..

[18]  Chao Yang,et al.  A peta-scalable CPU-GPU algorithm for global atmospheric simulations , 2013, PPoPP '13.

[19]  Thomas Steinke,et al.  Concurrent Kernel Execution on Xeon Phi within Parallel Heterogeneous Workloads , 2014, Euro-Par.

[20]  Anand Raghunathan,et al.  MDR: performance model driven runtime for heterogeneous parallel platforms , 2011, ICS '11.

[21]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[22]  Jeffrey S. Vetter,et al.  Maestro: Data Orchestration and Tuning for OpenCL Devices , 2010, Euro-Par.