SPIRAL: Extreme Performance Portability

In this paper, we address the question of how to automatically map computational kernels to highly efficient code for a wide range of computing platforms and establish the correctness of the synthesized code. More specifically, we focus on two fundamental problems that software developers are faced with: performance portability across the ever-changing landscape of parallel platforms and correctness guarantees for sophisticated floating-point code. The problem is approached as follows: We develop a formal framework to capture computational algorithms, computing platforms, and program transformations of interest, using a unifying mathematical formalism we call operator language (OL). Then we cast the problem of synthesizing highly optimized computational kernels for a given machine as a strongly constrained optimization problem that is solved by search and a multistage rewriting system. Since all rewrite steps are semantics preserving, our approach establishes equivalence between the kernel specification and the synthesized program. This approach is implemented in the SPIRAL system, and we demonstrate it with a selection of computational kernels from the signal and image processing domain, software-defined radio, and robotic vehicle control. Our target platforms range from mobile devices, desktops, and server multicore processors to large-scale high-performance and supercomputing systems, and we demonstrate performance comparable to expertly hand-tuned code across kernels and platforms.

[1]  Rudolf Eigenmann,et al.  PEAK—a fast and effective performance tuning system via compiler optimization orchestration , 2008, TOPL.

[2]  Franz Franchetti,et al.  Computer Generation of Hardware for Linear Digital Signal Processing Transforms , 2012, TODE.

[3]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Tinkara Toš,et al.  Graph Algorithms in the Language of Linear Algebra , 2012, Software, environments, tools.

[5]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[6]  Manuela M. Veloso,et al.  Learning to Construct Fast Signal Processing Implementations , 2002, J. Mach. Learn. Res..

[7]  Franz Franchetti,et al.  Computer generation of fast fourier transforms for the cell broadband engine , 2009, ICS '09.

[8]  Chris-Kriton Skylaris,et al.  Introducing ONETEP: linear-scaling density functional simulations on parallel computers. , 2005, The Journal of chemical physics.

[9]  Serge Winitzki,et al.  YACAS: A Do-It-Yourself Symbolic Algebra Environment , 2002, AISC.

[10]  Daniele G. Spampinato,et al.  A basic linear algebra compiler for structured matrices , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[11]  Jan Vitek,et al.  Terra: a multi-stage language for high-performance computing , 2013, PLDI.

[12]  Krzysztof Czarnecki,et al.  DSL Implementation in MetaOCaml, Template Haskell, and C++ , 2003, Domain-Specific Program Generation.

[13]  Paolo Bientinesi,et al.  Knowledge-Based Automatic Generation of Partitioned Matrix Expressions , 2011, CASC.

[14]  Franz Franchetti,et al.  A Rewriting System for the Vectorization of Signal Transforms , 2006, VECPAR.

[15]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[16]  Franz Franchetti,et al.  HAMLeT: Hardware accelerated memory layout transform within 3D-stacked DRAM , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[17]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[18]  Daisuke Takahashi,et al.  Japanese Autotuning Research: Autotuning Languages and FFT , 2018, Proceedings of the IEEE.

[19]  José M. F. Moura,et al.  Fast automatic software implementations of FIR filters , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[20]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[21]  James C. Hoe,et al.  Automatic generation of streaming datapaths for arbitrary fixed permutations , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[22]  Tze Meng Low,et al.  FFTX and SpectralPack: A First Look , 2018, 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW).

[23]  Franz Franchetti,et al.  Efficient Utilization of SIMD Extensions , 2005, Proceedings of the IEEE.

[24]  Franz Franchetti,et al.  Domain-specific library generation for parallel software and hardware platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[25]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[26]  Franz Franchetti,et al.  Generating high performance pruned FFT implementations , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Franz Franchetti,et al.  Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform , 2006, SC.

[28]  Franz Franchetti,et al.  Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets , 2011, ICS '11.

[29]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[30]  Sriram Krishnamoorthy,et al.  Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[31]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[32]  Basilio B. Fraguela,et al.  Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[33]  Tze Meng Low,et al.  High Assurance Code Generation for Cyber-Physical Systems , 2017, 2017 IEEE 18th International Symposium on High Assurance Systems Engineering (HASE).

[34]  Franz Franchetti,et al.  Operator Language: A Program Generation Framework for Fast Kernels , 2009, DSL.

[35]  Martin Odersky,et al.  Spiral in scala: towards the systematic construction of generators for performance libraries , 2014, GPCE '13.

[36]  Franz Franchetti,et al.  Discrete fourier transform on multicore , 2009, IEEE Signal Processing Magazine.

[37]  Robert A. van de Geijn,et al.  The science of deriving dense linear algebra algorithms , 2005, TOMS.

[38]  Doru-Thom Popovici,et al.  Generating Optimized Fourier Interpolation Routines for Density Functional Theory Using SPIRAL , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[39]  Roberto Erick Lopez-Herrejon,et al.  Generating product-lines of product-families , 2002, Proceedings 17th IEEE International Conference on Automated Software Engineering,.

[40]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[41]  R. C. Whaley,et al.  Automatically Tuned Linear Algebra Software (ATLAS) , 2011, Encyclopedia of Parallel Computing.

[42]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[43]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[44]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[45]  Benoît Meister,et al.  R-Stream Compiler , 2011, Encyclopedia of Parallel Computing.

[46]  Franz Franchetti,et al.  SIMD Vectorization of Non-Two-Power Sized FFTs , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[47]  Markus Püschel,et al.  Mechanical Derivation of Fused Multiply–Add Algorithms for Linear Transforms , 2007, IEEE Transactions on Signal Processing.

[48]  Franz Franchetti,et al.  Optimized parallel distribution load flow solver on commodity multi-core CPU , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[49]  Matteo Frigo A Fast Fourier Transform Compiler , 1999, PLDI.

[50]  Markus Püschel,et al.  Offline library adaptation using automatically generated heuristics , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[51]  Tze Meng Low,et al.  High-Assurance SPIRAL: End-to-End Guarantees for Robot and Car Control , 2017, IEEE Control Systems.

[52]  Tobias Gysi,et al.  STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[53]  Doru-Thom Popovici,et al.  First look: Linear algebra-based triangle counting without matrix multiplication , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[54]  Cleve B. Moler,et al.  Numerical computing with MATLAB , 2004 .

[55]  Franz Franchetti,et al.  Formal datapath representation and manipulation for implementing DSP transforms , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[56]  Doru-Thom Popovici,et al.  Mixed data layout kernels for vectorized complex arithmetic , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[57]  Don S. Batory,et al.  Achieving Extensibility Through Product-Lines and Domain-Specific Languages: A Case Study , 2000, ICSR.

[58]  Thomas Holenstein,et al.  Optimal Circuits for Streamed Linear Permutations Using RAM , 2016, FPGA.

[59]  David A. Padua,et al.  Programming for Locality and Parallelism with Hierarchically Tiled Arrays , 2003, LCPC.

[60]  Ken Kennedy,et al.  Automatic Type-Driven Library Generation for Telescoping Languages , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[61]  Frunz Frunchett,et al.  SHORT VECTOR CODE GENERATION AND ADAPTATION FOR DSP ALGORITHMS , 2003 .

[62]  Armando Solar-Lezama,et al.  Programming by sketching for bit-streaming programs , 2005, PLDI '05.

[63]  Jeremy Johnson,et al.  A Haskell compiler for signal transforms , 2017, GPCE.

[64]  Yevgen Voronenko,et al.  Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic , 2004 .

[65]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[66]  Robert A. van de Geijn,et al.  Designing Linear Algebra Algorithms by Transformation: Mechanizing the Expert Developer , 2012, VECPAR.

[67]  Markus Püschel,et al.  A Basic Linear Algebra Compiler , 2014, CGO '14.

[68]  Tze Meng Low,et al.  Optimizing FFT Resource Efficiency on FPGA using High-level Synthesis , 2017 .

[69]  Eran Yahav,et al.  Inferring Synchronization under Limited Observability , 2009, TACAS.

[70]  K. J. Gough Little language processing, an alternative to courses on compiler construction , 1981, SGCS.

[71]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[72]  Uday Bondhugula,et al.  PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System , 2015 .

[73]  Franz Franchetti,et al.  Linear Transforms : From Math to Efficient Hardware Extended , 2008 .

[74]  José M. F. Moura,et al.  Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[75]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[76]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[77]  Krzysztof Czarnecki,et al.  Generative programming - methods, tools and applications , 2000 .

[78]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[79]  Franz Franchetti,et al.  High-performance synthetic aperture radar image formation on commodity multicore architectures , 2009, Defense + Commercial Sensing.

[80]  Paolo Bientinesi,et al.  Automatic Generation of Loop-Invariants for Matrix Operations , 2011, 2011 International Conference on Computational Science and Its Applications.

[81]  Franz Franchetti,et al.  Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[82]  Manuela M. Veloso,et al.  Automating the modeling and optimization of the performance of signal transforms , 2002, IEEE Trans. Signal Process..

[83]  James C. Hoe,et al.  Permuting streaming data using RAMs , 2009, JACM.

[84]  A.J. Viterbi A personal history of the Viterbi algorithm , 2006, IEEE Signal Processing Magazine.

[85]  David S. Wise,et al.  Generic support of algorithmic and structural recursion for scientific computing , 2009, Int. J. Parallel Emergent Distributed Syst..

[86]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[87]  Markus Püschel,et al.  Bandit-based optimization on graphs with application to library performance tuning , 2009, ICML '09.

[88]  Franz Franchetti,et al.  Autotuning a Random Walk Boolean Satisfiability Solver , 2011, ICCS.

[89]  W. Taha,et al.  Plenary talk III Domain-specific languages , 2008, 2008 International Conference on Computer Engineering & Systems.

[90]  Franz Franchetti,et al.  How to Write Fast Numerical Code: A Small Introduction , 2007, GTTSE.

[91]  daniel Scott. Smith Mechanizing the development of software , 1991 .

[92]  Doru-Thom Popovici,et al.  Large Bandwidth-Efficient FFTs on Multicore and Multi-socket Systems , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[93]  Tiark Rompf,et al.  How to Architect a Query Compiler, Revisited , 2018, SIGMOD Conference.

[94]  Franz Franchetti,et al.  Real-time software implementation of an IEEE 802.11a baseband receiver on Intel multicore , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[95]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[96]  Siegfried Benkner,et al.  Compiling High Performance Fortran for distributed-memory architectures , 1999, Parallel Comput..

[97]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[98]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[99]  Franz Franchetti,et al.  Discrete Fourier Transform Compiler : From Mathematical Representation to Efficient Hardware , 2007 .

[100]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[101]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[102]  David A. Padua,et al.  Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[103]  Jeremy R. Johnson,et al.  Automatic derivation and implementation of fast convolution algorithms , 2004, J. Symb. Comput..

[104]  Markus Püschel,et al.  Computer Generation of General Size Linear Transform Libraries , 2009, 2009 International Symposium on Code Generation and Optimization.

[105]  Franz Franchetti,et al.  System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries , 2008, AMAST.

[106]  Franz Franchetti,et al.  Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P , 2012, VECPAR.

[107]  James C. Hoe,et al.  Fast and accurate resource estimation of automatically generated custom DFT IP cores , 2006, FPGA '06.

[108]  Franz Franchetti,et al.  Generating FPGA-Accelerated DFT Libraries , 2007 .

[109]  Franz Franchetti,et al.  Short vector code generation for the discrete Fourier transform , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[110]  Michael F. P. O'Boyle,et al.  MILEPOST GCC: machine learning based research compiler , 2008 .

[111]  Jan Maluszynski,et al.  Logic, Programming and Prolog (2ed) , 1995 .

[112]  Tze Meng Low,et al.  Optimizing Space Time Adaptive Processing through accelerating memory-bounded operations , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[113]  Hao Shen Generation of a Fast JPEG 2000 Encoder using SPIRAL , 2008 .

[114]  Paolo Bientinesi,et al.  Program generation for small-scale linear algebra applications , 2018, CGO.

[115]  José M. F. Moura,et al.  Automatic implementation and platform adaptation of discrete filtering and wavelet algorithms , 2004 .

[116]  Calvin Lin,et al.  An annotation language for optimizing software libraries , 1999, DSL '99.

[117]  Torsten Hoefler,et al.  Polly-ACC Transparent compilation to heterogeneous hardware , 2016, ICS.

[118]  R. W. Johnson,et al.  A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures , 1990 .

[119]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[120]  Jon Louis Bentley,et al.  Programming pearls: little languages , 1986, CACM.

[121]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[122]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[123]  Sergei Gorlatch,et al.  High performance stencil code generation with Lift , 2018, CGO.

[124]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[125]  Chi-Bang Kuan,et al.  Automated Empirical Optimization , 2011, Encyclopedia of Parallel Computing.

[126]  Elizabeth R. Jessup,et al.  Reliable Generation of High-Performance Matrix Algebra , 2012, ACM Trans. Math. Softw..

[127]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[128]  Don H. Johnson,et al.  Gauss and the history of the fast Fourier transform , 1985 .

[129]  Elizabeth R. Jessup,et al.  Build to order linear algebra kernels , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[130]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[131]  Arvind,et al.  What is Bluespec? , 2009, SIGD.

[132]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[133]  Paolo Bientinesi,et al.  A Domain-Specific Compiler for Linear Algebra Operations , 2012, VECPAR.

[134]  Tiark Rompf,et al.  Staging for generic programming in space and time , 2017, GPCE.

[135]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[136]  Manuela M. Veloso,et al.  Focused optimization for online detection of anomalous regions , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[137]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[138]  M. Puschel,et al.  FFT Program Generation for Shared Memory: SMP and Multicore , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[139]  André Platzer,et al.  KeYmaera: A Hybrid Theorem Prover for Hybrid Systems (System Description) , 2008, IJCAR.

[140]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[141]  Franz Franchetti,et al.  Generating SIMD Vectorized Permutations , 2008, CC.

[142]  Franz Franchetti,et al.  FFT Compiler: from math to efficient hardware HLDVT invited short paper , 2007, 2007 IEEE International High Level Design Validation and Test Workshop.

[143]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[144]  Paul H. J. Kelly,et al.  Optimised three-dimensional Fourier interpolation: An analysis of techniques and application to a linear-scaling density functional theory code , 2015, Comput. Phys. Commun..

[145]  Franz Franchetti,et al.  Algebraic description and automatic generation of multigrid methods in SPIRAL , 2017, Concurr. Comput. Pract. Exp..

[146]  Franz Franchetti,et al.  Computer Generation of Efficient Software Viterbi Decoders , 2010, HiPEAC.

[147]  Amir Shaikhha,et al.  How to Architect a Query Compiler , 2016, SIGMOD Conference.

[148]  Franz Franchetti,et al.  Formal loop merging for signal transforms , 2005, PLDI '05.

[149]  Stephen Wolfram,et al.  The Mathematica book, 5th Edition , 2003 .

[150]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[151]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[152]  Martin Fowler,et al.  Domain-Specific Languages , 2010, The Addison-Wesley signature series.

[153]  Franz Franchetti,et al.  Spiral-generated modular FFT algorithms , 2010, PASCO.

[154]  Franz Franchetti,et al.  A SIMD vectorizing compiler for digital signal processing algorithms , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[155]  Franz Franchetti,et al.  Computer Generation of Platform-Adapted Physical Layer Software , 2010 .

[156]  Ken Kennedy,et al.  The rise and fall of High Performance Fortran: an historical object lesson , 2007, HOPL.

[157]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[158]  Franz Franchetti,et al.  Hardware implementation of the discrete fourier transform with non-power-of-two problem size , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[159]  James C. Hoe,et al.  Automatic generation of customized discrete Fourier transform IPs , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[160]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[161]  John Shalf,et al.  SEJITS: Getting Productivity and Performance With Selective Embedded JIT Specialization , 2010 .

[162]  Franz Franchetti,et al.  Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers , 2006, ISPA.

[163]  Chua-Huang Huang,et al.  Multilinear algebra and parallel programming , 1990, Proceedings SUPERCOMPUTING '90.

[164]  Franz Franchetti,et al.  Performance/Energy Optimization of DSP Transforms on the XScale Processor , 2007, HiPEAC.

[165]  Michael J. C. Gordon,et al.  From LCF to HOL: a short history , 2000, Proof, Language, and Interaction.

[166]  Kunle Olukotun,et al.  Delite , 2014, ACM Trans. Embed. Comput. Syst..

[167]  Philip Heidelberger,et al.  The Blue Gene/L Supercomputer: A Hardware and Software Story , 2007, International Journal of Parallel Programming.