Automatic Task-Based Code Generation for High Performance Domain Specific Embedded Language

Providing high level tools for parallel programming while sustaining a high level of performance has been a challenge that techniques like Domain Specific Embedded Languages try to solve. In previous works, we investigated the design of such a DSEL—NT$$^2$$2—providing a Matlab -like syntax for parallel numerical computations inside a C++ library. In this paper, we show how NT$$^2\!$$2 has been redesigned for shared memory systems in an extensible and portable way. The new NT$$^2\!$$2 design relies on a tiered Parallel Skeleton system built using asynchronous task management and automatic compile-time taskification of user level code. We describe how this system can operate various shared memory runtimes and evaluate the design by using two benchmarks implementing linear algebra algorithms.

[1]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[2]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[3]  Eric Niebler,et al.  Proto: a compiler construction toolkit for DSELs , 2007, LCSD '07.

[4]  Patrizio Dazzi,et al.  Scalable Computing: Practice and Experience WSSP, Warsaw, Poland, 2007. To appear. MUSKEL: AN EXPANDABLE SKELETON ENVIRONMENT∗ , 2007 .

[5]  Herbert Kuchen,et al.  A Skeleton Library , 2002, Euro-Par.

[6]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[7]  Paul Hudak,et al.  Building domain-specific embedded languages , 1996, CSUR.

[8]  Diomidis Spinellis,et al.  Notable design patterns for domain-specific languages , 2001, J. Syst. Softw..

[9]  David Vandevoorde,et al.  C++ Templates , 2002 .

[10]  Maude Moore,et al.  Boost , 1925 .

[11]  F. Black,et al.  The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.

[12]  Thomas Hérault,et al.  From Serial Loops to Parallel Execution on Distributed Systems , 2012, Euro-Par.

[13]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[14]  Clemens Grelck,et al.  SAC—A Functional Array Language for Efficient Multi-threaded Execution , 2006, International Journal of Parallel Programming.

[15]  Nancy M. Amato,et al.  STAPL: An Adaptive, Generic Parallel C++ Library , 2001, LCPC.

[16]  Herbert Kuchen,et al.  Enhancing Muesli's Data Parallel Skeletons for Multi-core Computer Architectures , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[17]  Marco Danelutto,et al.  FastFlow: High-level and Efficient Streaming on Multi-core , 2017 .

[18]  David Abrahams,et al.  C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ In-Depth Series) , 2004 .

[19]  Stephen Gilmore,et al.  Flexible Skeletal Programming with eSkel , 2005, Euro-Par.

[20]  Masato Takeichi,et al.  Domain-Specific Optimization Strategy for Skeleton Programs , 2007, Euro-Par.

[21]  Jean-Thierry Lapresté,et al.  The numerical template toolbox: A modern C++ design for scientific computing , 2014, J. Parallel Distributed Comput..

[22]  Nancy M. Amato,et al.  STAPL: standard template adaptive parallel library , 2010, SYSTOR '10.

[23]  Jean-Thierry Lapresté,et al.  Meta-programming Applied to Automatic SMP Parallelization of Linear Algebra Code , 2008, Euro-Par.

[24]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[25]  Largo B. Pontecorvo,et al.  Optimization Techniques for Implementing Parallel Skeletons in Grid Environments , 2004 .

[26]  Carl Hewitt,et al.  The incremental garbage collection of processes , 1977, Artificial Intelligence and Programming Languages.

[27]  Wai-Mee Ching,et al.  Automatic Parallelization of Array-oriented Programs for a Multi-core Machine , 2012, International Journal of Parallel Programming.

[28]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[29]  Todd L. Veldhuizen,et al.  Expression templates , 1996 .

[30]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[31]  Anne-Marie Kermarrec,et al.  Euro-Par 2007, Parallel Processing, 13th International Euro-Par Conference, Rennes, France, August 28-31, 2007, Proceedings , 2007, Euro-Par.

[32]  Sandeep Koranne,et al.  Boost C++ Libraries , 2011 .

[33]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[34]  Robert Glück,et al.  Generative Programming and Active Libraries , 1998, Generic Programming.

[35]  Laurence Tratt,et al.  Model transformations and tool integration , 2005, Software & Systems Modeling.

[36]  Murray Cole,et al.  Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming , 2004, Parallel Comput..

[37]  Brigitte Rozoy,et al.  Boost.SIMD: generic programming for portable SIMDization , 2012, PACT '12.