Adaptive Off-Line Tuning for Optimized Composition of Components for Heterogeneous Many-Core Systems

In recent years heterogeneous multi-core systems have been given much attention. However, performance optimization on these platforms remains a big challenge. Optimizations performed by compilers are often limited due to lack of dynamic information and run time environment, which makes applications often not performance portable. One current approach is to provide multiple implementations for the same interface that could be used interchangeably depending on the call context, and expose the composition choices to a compiler, deployment-time composition tool and/or run-time system. Using off-line machine-learning techniques allows to improve the precision and reduce the run-time overhead of run-time composition and leads to an improvement of performance portability. In this work we extend the run-time composition mechanism in the PEPPHER composition tool by off-line composition and present an adaptive machine learning algorithm for generating compact and efficient dispatch data structures with low training time. As dispatch data structure we propose an adaptive decision tree structure, which implies an adaptive training algorithm that allows to control the trade-off between training time, dispatch precision and run-time dispatch overhead.

[1]  Cédric Augonnet,et al.  Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures , 2009, Euro-Par Workshops.

[2]  Markus Püschel,et al.  Offline library adaptation using automatically generated heuristics , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Cédric Augonnet,et al.  PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems , 2011, IEEE Micro.

[4]  Manuela M. Veloso,et al.  Learning to Predict Performance from Formula Modeling and Training Data , 2000, ICML.

[5]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[6]  Michael F. P. O'Boyle,et al.  Reducing Training Time in a One-Shot Machine Learning-Based Compiler , 2009, LCPC.

[7]  Christoph W. Kessler,et al.  The PEPPHER Composition Tool: Performance-Aware Dynamic Composition of Applications for GPU-Based Systems , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[8]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[9]  Christoph W. Kessler,et al.  Comparing Machine Learning Approaches for Context-Aware Composition , 2011, SC@TOOLS.

[10]  Nancy M. Amato,et al.  A framework for adaptive algorithm selection in STAPL , 2005, PPoPP.

[11]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[12]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[13]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[14]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15]  Christoph W. Kessler,et al.  Optimized composition of performance‐aware parallel components , 2012, Concurr. Comput. Pract. Exp..

[16]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[17]  Greg Stitt,et al.  Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing , 2010, LCTES '10.

[18]  Christoph W. Kessler,et al.  A Framework for Performance-Aware Composition of Explicitly Parallel Components , 2007, PARCO.

[19]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[20]  Michael Alexander,et al.  Euro-Par 2009 – Parallel Processing Workshops: HPPC, HeteroPar, PROPER, ROIA, UNICORE, VHPC, Delft, The Netherlands, August 25-28, 2009, Revised Selected Papers , 2010, Euro-Par Workshops.

[21]  Sameer Kulkarni,et al.  An evaluation of different modeling techniques for iterative compilation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[22]  Xiaoming Li,et al.  Optimizing Matrix Multiplication with a Classifier Learning System , 2005, LCPC.

[23]  Lawrence Rauchwerger,et al.  An Adaptive Algorithm Selection Framework for Reduction Parallelization , 2006, IEEE Transactions on Parallel and Distributed Systems.

[24]  David A. Padua,et al.  A dynamically tuned sorting library , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[25]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[26]  Manuela M. Veloso,et al.  Learning to Construct Fast Signal Processing Implementations , 2002, J. Mach. Learn. Res..

[27]  Takahiro Katagiri,et al.  ABCLibScript: a directive to support specification of an auto-tuning facility for numerical software , 2006, Parallel Comput..