Building-Blocks for Performance Oriented DSLs

Domain-specific languages raise the level of abstraction in software development. While it is evident that programmers can more easily reason about very high-level programs, the same holds for compilers only if the compiler has an accurate model of the application domain and the underlying target platform. Since mapping high-level, general-purpose languages to modern, heterogeneous hardware is becoming increasingly difficult, DSLs are an attractive way to capitalize on improved hardware performance, precisely by making the compiler reason on a higher level. Implementing efficient DSL compilers is a daunting task however, and support for building performance-oriented DSLs is urgently needed. To this end, we present the Delite Framework, an extensible toolkit that drastically simplifies building embedded DSLs and compiling DSL programs for execution on heterogeneous hardware. We discuss several building blocks in some detail and present experimental results for the OptiML machine-learning DSL implemented on top of Delite.

[1]  Walid Taha,et al.  A methodology for generating verified combinatorial circuits , 2004, EMSOFT '04.

[2]  Jens Palsberg,et al.  Concurrent Collections , 2010, Sci. Program..

[3]  Walid Taha,et al.  MetaML and multi-stage programming with explicit annotations , 2000, Theor. Comput. Sci..

[4]  Magne Haveraaen,et al.  Design of the CodeBoost transformation system for domain-specific optimisation of C++ programs , 2003, Proceedings Third IEEE International Workshop on Source Code Analysis and Manipulation.

[5]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[6]  Todd L. Veldhuizen,et al.  Active libraries and universal languages , 2004 .

[7]  Matthew Might,et al.  Pushdown Control-Flow Analysis of Higher-Order Programs , 2010, ArXiv.

[8]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[9]  John McCarthy,et al.  A basis for a mathematical theory of computation, preliminary report , 1899, IRE-AIEE-ACM '61 (Western).

[10]  Olivier Danvy,et al.  Abstracting control , 1990, LISP and Functional Programming.

[11]  Simon L. Peyton Jones,et al.  Template meta-programming for Haskell , 2002, Haskell '02.

[12]  Oliver Sinnen,et al.  Task Scheduling for Parallel Systems (Wiley Series on Parallel and Distributed Computing) , 2007 .

[13]  Walid Taha,et al.  Multi-Stage Programming: Its Theory and Applications , 1999 .

[14]  Michael D. McCool,et al.  Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[15]  Todd L. Veldhuizen,et al.  Arrays in Blitz++ , 1998, ISCOPE.

[16]  Jacques Carette,et al.  Finally tagless, partially evaluated: Tagless staged interpreters for simpler typed languages , 2007, Journal of Functional Programming.

[17]  Ken Kennedy,et al.  Telescoping Languages: A System for Automatic Generation of Domain Languages , 2005, Proceedings of the IEEE.

[18]  Calvin Lin,et al.  An annotation language for optimizing software libraries , 1999, DSL '99.

[19]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[20]  Kunle Olukotun,et al.  Language virtualization for heterogeneous parallel computing , 2010, OOPSLA.

[21]  Roman Leshchinskiy,et al.  Stream fusion: from lists to streams to nothing at all , 2007, ICFP '07.

[22]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[23]  Walid Taha,et al.  Implementing Multi-stage Languages Using ASTs, Gensym, and Reflection , 2003, GPCE.

[24]  Martin Odersky,et al.  Implementing first-class polymorphic delimited continuations by a type-directed selective CPS-transform , 2009, ICFP.

[25]  Guy L. Steele Parallel Programming and Parallel Abstractions in Fortress , 2005, IEEE PACT.

[26]  Douglas Gregor,et al.  C++ Templates: The Complete Guide , 2002 .

[27]  Olivier Danvy,et al.  Representing Control: a Study of the CPS Transformation , 1992, Mathematical Structures in Computer Science.

[28]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[29]  Paul H. J. Kelly,et al.  Runtime Code Generation in C++ as a Foundation for Domain-Specific Optimisation , 2003, Domain-Specific Program Generation.

[30]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[31]  Philip Wadler,et al.  Deforestation: Transforming Programs to Eliminate Trees , 1990, Theor. Comput. Sci..

[32]  Peter Sestoft,et al.  Partial evaluation and automatic program generation , 1993, Prentice Hall international series in computer science.

[33]  Olin Shivers,et al.  CFA2: A Context-Free Approach to Control-Flow Analysis , 2010, ESOP.

[34]  FrigoMatteo,et al.  A fast Fourier transform compiler , 1999 .

[35]  Kunle Olukotun,et al.  OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[36]  Elizabeth R. Jessup,et al.  Automating the generation of composed linear algebra kernels , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[37]  Simon L. Peyton Jones,et al.  Harnessing the Multicores: Nested Data Parallelism in Haskell , 2008, FSTTCS.

[38]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[39]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[40]  Paul Hudak,et al.  Building domain-specific embedded languages , 1996, CSUR.

[41]  Guy L. Steele,et al.  Parallel Programming and Parallel Abstractions in Fortress , 2005, IEEE PACT.

[42]  Christian Hofer,et al.  Polymorphic embedding of dsls , 2008, GPCE '08.

[43]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[44]  Milind Girkar,et al.  EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system , 2007, PLDI '07.

[45]  Michael Metcalf,et al.  High performance Fortran , 1995 .

[46]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[47]  Walid Taha,et al.  A sound reduction semantics for untyped CBN mutli-stage computation. Or, the theory of MetaML is non-trival (extended abstract) , 1999, PEPM '00.

[48]  Jacques Carette,et al.  Finally Tagless, Partially Evaluated , 2007, APLAS.

[49]  José M. F. Moura,et al.  Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..

[50]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[51]  Kunle Olukotun,et al.  A domain-specific approach to heterogeneous parallelism , 2011, PPoPP '11.