Mapping parallel programs to heterogeneous CPU/GPU architectures using a Monte Carlo Tree Search

The single core processor, which has dominated for over 30 years, is now obsolete with recent trends increasing towards parallel systems, demanding a huge shift in programming techniques and practices. Moreover, we are rapidly moving towards an age where almost all programming will be targeting parallel systems. Parallel hardware is rapidly evolving, with large heterogeneous systems, typically comprising a mixture of CPUs and GPUs, becoming the mainstream. Additionally, with this increasing heterogeneity comes increasing complexity: not only does the programmer have to worry about where and how to express the parallelism, they must also express an efficient mapping of resources to the available system. This generally requires in-depth expert knowledge that most application programmers do not have. In this paper we describe a new technique that derives, automatically, optimal mappings for an application onto a heterogeneous architecture, using a Monte Carlo Tree Search algorithm. Our technique exploits high-level design patterns, targeting a set of well-specified parallel skeletons. We demonstrate that our MCTS on a convolution example obtained speedups that are within 5% of the speedups achieved by a hand-tuned version of the same application.

[1]  Sylvain Gelly,et al.  Exploration exploitation in Go: UCT for Monte-Carlo Go , 2006, NIPS 2006.

[2]  Terry Jones,et al.  Fitness Distance Correlation as a Measure of Problem Difficulty for Genetic Algorithms , 1995, ICGA.

[3]  Peter Kilpatrick,et al.  Cost-Directed Refactoring for Parallel Erlang Programs , 2013, International Journal of Parallel Programming.

[4]  Krithi Ramamritham,et al.  Dynamic Task Scheduling in Hard Real-Time Distributed systems , 1984, IEEE Software.

[5]  Rafael Asenjo,et al.  Analytical Modeling of Pipeline Parallelism , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[6]  Anthony P. Blozinski,et al.  CONVOLUTION OF L(p,q) FUNCTIONS , 2010 .

[7]  Peter Kilpatrick,et al.  Accelerating Code on Multi-cores with FastFlow , 2011, Euro-Par.

[8]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[9]  Thomas M. Conte,et al.  Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems , 2003 .

[10]  Zheng Wang,et al.  Machine learning based mapping of data and streaming parallelism to multi-cores , 2011 .

[11]  Chantal Ykman-Couvreur Exploration framework for Run-time Resource Management of embedded multi-core platforms , 2010, 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[12]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[13]  Chantal Ykman-Couvreur,et al.  Linking run-time resource management of embedded multi-core platforms with automated design-time exploration , 2011, IET Comput. Digit. Tech..

[14]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[15]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[16]  Jos W. H. M. Uiterwijk,et al.  Monte-Carlo tree search in production management problems , 2006 .

[17]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[18]  Bruno Bouzy,et al.  Monte-Carlo strategies for computer Go , 2006 .

[19]  Michael F. P. O'Boyle,et al.  Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[21]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[22]  Sergei Gorlatch,et al.  SkelCL - A Portable Skeleton Library for High-Level GPU Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[23]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[24]  Wim Hordijk,et al.  Correlation analysis of coupled fitness landscapes: Research Articles , 2005 .

[25]  Jan Willemson,et al.  Improved Monte-Carlo Search , 2006 .

[26]  Abhishek Udupa,et al.  Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[27]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[28]  Wim Hordijk,et al.  Correlation analysis of coupled fitness landscapes , 2005, Complex..

[29]  Thomas R. Gross,et al.  Exploiting task and data parallelism on a multicomputer , 1993, PPOPP '93.

[30]  Chantal Ykman-Couvreur,et al.  Design-time application mapping and platform exploration for MP-SoC customised run-time management , 2007, IET Comput. Digit. Tech..

[31]  Claude Gasquet,et al.  The Convolution of Functions , 1999 .

[32]  Horacio González-Vélez,et al.  A survey of algorithmic skeleton frameworks: high‐level structured parallel programming enablers , 2010, Softw. Pract. Exp..

[33]  Scott A. Mahlke,et al.  Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[34]  Vivek Sarkar,et al.  X10: concurrent programming for modern architectures , 2007, PPOPP.

[35]  Guillaume Maurice Jean-Bernard Chaslot Chaslot,et al.  Monte-Carlo Tree Search , 2010 .

[36]  William Thies,et al.  Optimizing stream programs using linear state space analysis , 2005, CASES '05.

[37]  Martin Müller,et al.  Monte-Carlo Exploration for Deterministic Planning , 2009, IJCAI.

[38]  Nataliya Sokolovska,et al.  Continuous Upper Confidence Trees , 2011, LION.

[39]  Horacio González-Vélez,et al.  Heterogeneous Algorithmic Skeletons for Fast Flow with Seamless Coordination over Hybrid Architectures , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[40]  Horacio González-Vélez,et al.  Streaming Dynamic Coarse-Grained CPU/GPU Workloads with Heterogeneous Pipelines in FastFlow , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.