Automatic Performance Tuning of Pipeline Patterns for Heterogeneous Parallel Architectures

Heterogeneous parallel architectures combining conventional multicore CPUs with GPUs and other types of accelerators promise significant performance gains compared to homogeneous systems. However, exploiting the full potential of such systems is becoming more and more challenging often forcing programmers to combine different programming models and parallelization strategies. A promising approach to coping with the increased programming complexity is the use of parallel patterns for expressing certain types of computations at a high-level of abstraction while relying on the compiler and runtime system to map such patterns onto a heterogeneous system. In this paper we present an approach for automatic performance tuning of high-level pipeline patterns for heterogeneous parallel systems in the context of a task-parallel component-based programming model. Our automatic performance tuning approach attempts to automatically determine the best combination of pattern-specific parameters, parameters exposed by the runtime system, and machine-specific parameters such that execution is optimized for a given workload and target architecture. Experimental results on two state-of-the-art heterogeneous systems demonstrate the effectiveness of our

[1]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[2]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[3]  I-Hsin Chung,et al.  Using Information from Prior Runs to Improve Automated Tuning Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[4]  Cédric Augonnet,et al.  PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems , 2011, IEEE Micro.

[5]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[6]  Michael Gerndt,et al.  PERISCOPE: An Online-Based Distributed Performance Analysis Tool , 2009, Parallel Tools Workshop.

[7]  Anna Sikora,et al.  AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications , 2012, PARA.

[8]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[9]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[10]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  Siegfried Benkner,et al.  Automatic Tuning of a Parallel Pattern Library for Heterogeneous Systems with Intel Xeon Phi , 2014, 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[12]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[13]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[14]  Wen-mei W. Hwu,et al.  Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.

[15]  Alex Zelinsky,et al.  Learning OpenCV---Computer Vision with the OpenCV Library (Bradski, G.R. et al.; 2008)[On the Shelf] , 2009, IEEE Robotics & Automation Magazine.

[16]  Samuel Thibault,et al.  High-Level Support for Pipeline Parallelism on Many-Core Architectures , 2012, Euro-Par.

[17]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[18]  Siegfried Benkner,et al.  Using explicit platform descriptions to support programming of heterogeneous many-core systems , 2012, Parallel Comput..

[19]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[20]  Kristina Lerman,et al.  Model-guided performance tuning of parameter values: A case study with molecular dynamics visualization , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21]  Siegfried Benkner,et al.  HyPHI - Task Based Hybrid Execution C++ Library for the Intel Xeon Phi Coprocessor , 2013, 2013 42nd International Conference on Parallel Processing.

[22]  Peter M. W. Knijnenburg,et al.  Automatic selection of compiler options using non-parametric inferential statistics , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[23]  David I. August,et al.  Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..