Spiral in scala: towards the systematic construction of generators for performance libraries

Program generators for high performance libraries are an appealing solution to the recurring problem of porting and optimizing code with every new processor generation, but only few such generators exist to date. This is due to not only the difficulty of the design, but also of the actual implementation, which often results in an ad-hoc collection of standalone programs and scripts that are hard to extend, maintain, or reuse. In this paper we ask whether and which programming language concepts and features are needed to enable a more systematic construction of such generators. The systematic approach we advocate extrapolates from existing generators: a) describing the problem and algorithmic knowledge using one, or several, domain-specific languages (DSLs), b) expressing optimizations and choices as rewrite rules on DSL programs, c) designing data structures that can be configured to control the type of code that is generated and the data representation used, and d) using autotuning to select the best-performing alternative. As a case study, we implement a small, but representative subset of Spiral in Scala using the Lightweight Modular Staging (LMS) framework. The first main contribution of this paper is the realization of c) using type classes to abstract over staging decisions, i.e. which pieces of a computation are performed immediately and for which pieces code is generated. Specifically, we abstract over different complex data representations jointly with different code representations including generating loops versus unrolled code with scalar replacement - a crucial and usually tedious performance transformation. The second main contribution is to provide full support for a) and d) within the LMS framework: we extend LMS to support translation between different DSLs and autotuning through search.

[1]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[2]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[3]  Todd L. Veldhuizen,et al.  Arrays in Blitz++ , 1998, ISCOPE.

[4]  Steve Karmesin,et al.  Array Design and Expression Evaluation in POOMA II , 1998, ISCOPE.

[5]  Eelco Visser,et al.  Stratego/XT 0.17. A language and toolset for program transformation , 2008, Sci. Comput. Program..

[6]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[7]  José M. F. Moura,et al.  Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[8]  Stephen P. Boyd,et al.  CVXGEN: a code generator for embedded convex optimization , 2011, Optimization and Engineering.

[9]  Peter Sestoft,et al.  Partial evaluation and automatic program generation , 1993, Prentice Hall international series in computer science.

[10]  Walid Taha,et al.  Relating FFTW and Split-Radix , 2004, ICESS.

[11]  M. Puschel,et al.  FFT Program Generation for Shared Memory: SMP and Multicore , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[12]  Eelco Visser,et al.  The Spoofax language workbench , 2010, SPLASH/OOPSLA Companion.

[13]  Philip Wadler,et al.  How to make ad-hoc polymorphism less ad hoc , 1989, POPL '89.

[14]  Kunle Olukotun,et al.  A Heterogeneous Parallel Framework for Domain-Specific Languages , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[15]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[16]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[17]  Anders Logg,et al.  Automated Solution of Differential Equations by the Finite Element Method: The FEniCS Book , 2012 .

[18]  Kunle Olukotun,et al.  Optimizing data structures in high-level programs: new directions for extensible compilers based on staging , 2013, POPL.

[19]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[20]  Walid Taha,et al.  MetaML and multi-stage programming with explicit annotations , 2000, Theor. Comput. Sci..

[21]  Chung-chieh Shan,et al.  Shonan challenge for generative programming: short position paper , 2013, PEPM '13.

[22]  Kunle Olukotun,et al.  Language virtualization for heterogeneous parallel computing , 2010, OOPSLA.

[23]  Walid Taha,et al.  Implementing Multi-stage Languages Using ASTs, Gensym, and Reflection , 2003, GPCE.

[24]  David A. Padua,et al.  In search of a program generator to implement generic transformations for high-performance computing , 2006, Sci. Comput. Program..

[25]  Walid Taha,et al.  A methodology for generating verified combinatorial circuits , 2004, EMSOFT '04.

[26]  Eelco Visser,et al.  The spoofax language workbench: rules for declarative specification of languages and IDEs , 2010, OOPSLA.

[27]  Sam Tobin-Hochstadt,et al.  Languages as libraries , 2011, PLDI '11.

[28]  Kunle Olukotun,et al.  Building-Blocks for Performance Oriented DSLs , 2011, DSL.

[29]  Kunle Olukotun,et al.  OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[30]  Elizabeth R. Jessup,et al.  Automating the generation of composed linear algebra kernels , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[31]  Franz Franchetti,et al.  Formal loop merging for signal transforms , 2005, PLDI '05.

[32]  Kunle Olukotun,et al.  Implementing Domain-Specific Languages for Heterogeneous Parallel Computing , 2011, IEEE Micro.

[33]  Ulf Norell,et al.  Polytypic Programming in Haskell , 2003, IFL.

[34]  OlukotunKunle,et al.  Optimizing data structures in high-level programs , 2013 .

[35]  Markus Püschel,et al.  Computer Generation of General Size Linear Transform Libraries , 2009, 2009 International Symposium on Code Generation and Optimization.

[36]  Keith D. Cooper,et al.  Combining analyses, combining optimizations , 1995, TOPL.

[37]  Tiark Rompf,et al.  Lightweight Modular Staging and Embedded Compilers - Abstraction without Regret for High-Level High-Performance Programming , 2012 .

[38]  Franz Franchetti,et al.  A Rewriting System for the Vectorization of Signal Transforms , 2006, VECPAR.

[39]  Martin Odersky,et al.  Higher-order and Symbolic Computation Manuscript No. Scala-virtualized: Linguistic Reuse for Deep Embeddings , 2022 .