Vectorization for digital signal processors via equality saturation

Applications targeting digital signal processors (DSPs) benefit from fast implementations of small linear algebra kernels. While existing auto-vectorizing compilers are effective at extracting performance from large kernels, they struggle to invent the complex data movements necessary to optimize small kernels. To get the best performance, DSP engineers must hand-write and tune specialized small kernels for a wide spectrum of applications and architectures. We present Diospyros, a search-based compiler that automatically finds efficient vectorizations and data layouts for small linear algebra kernels. Diospyros combines symbolic evaluation and equality saturation to vectorize computations with irregular structure. We show that a collection of Diospyros-compiled kernels outperform implementations from existing DSP libraries by 3.1× on average, that Diospyros can generate kernels that are competitive with expert-tuned code, and that optimizing these small kernels offers end-to-end speedup for a DSP application.

[1]  Paolo Bientinesi,et al.  Program generation for small-scale linear algebra applications , 2018, CGO.

[2]  Alvin Cheung,et al.  Verified lifting of stencil computations , 2016, PLDI.

[3]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[4]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[5]  Pedro Trancoso,et al.  Video SIMDBench: Benchmarking the Compiler Vectorization for Multimedia Applications , 2016, 2016 Euromicro Conference on Digital System Design (DSD).

[6]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[7]  Charles Gregory Nelson,et al.  Techniques for program verification , 1979 .

[8]  Sumit Gulwani,et al.  From relational verification to SIMD loop synthesis , 2013, PPoPP '13.

[9]  Hila Peleg,et al.  Perfect is the Enemy of Good: Best-Effort Program Synthesis (Artifact) , 2020, Dagstuhl Artifacts Ser..

[10]  Markus Püschel,et al.  A basic linear algebra compiler for embedded processors , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[11]  Emina Torlak,et al.  A lightweight symbolic virtual machine for solver-aided host languages , 2014, PLDI.

[12]  Ken Sakurada,et al.  OpenVSLAM: A Versatile Visual SLAM Framework , 2019, ACM Multimedia.

[13]  Ran Ginosar,et al.  The effect of communication and synchronization on Amdahl's law in multicore systems , 2013, Parallel Comput..

[14]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[15]  Saman P. Amarasinghe,et al.  goSLP: globally optimized superword level parallelism framework , 2018, Proc. ACM Program. Lang..

[16]  Franz Franchetti,et al.  Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets , 2011, ICS '11.

[17]  Rachit Nigam,et al.  A Synthesis-Aided Compiler for DSP Architectures (WiP Paper) , 2020, LCTES.

[18]  James R. Wilcox,et al.  Synthesizing structured CAD models with equality saturation and inverse transformations , 2019, PLDI.

[19]  Sanjit A. Seshia,et al.  Combinatorial sketching for finite programs , 2006, ASPLOS XII.

[20]  Michael Stepp,et al.  Equality saturation: a new approach to optimization , 2009, POPL '09.

[21]  Henk Corporaal,et al.  Extending Halide to Improve Software Development for Imaging DSPs , 2017, TACO.

[22]  Keith H. Randall,et al.  Denali: a goal-directed superoptimizer , 2002, PLDI '02.

[23]  Emina Torlak,et al.  Optimizing synthesis with metasketches , 2016, POPL.

[24]  An Wang,et al.  Swizzle Inventor: Data Movement Synthesis for GPU Kernels , 2019, ASPLOS.

[25]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[26]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[27]  Armando Solar-Lezama,et al.  MSL: A Synthesis Enabled Language for Distributed Implementations , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  CorporaalHenk,et al.  Extending Halide to Improve Software Development for Imaging DSPs , 2017 .

[29]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[30]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[31]  Franz Franchetti,et al.  Generating SIMD Vectorized Permutations , 2008, CC.

[32]  Paliath Narendran,et al.  Complexity of Matching Problems , 1987, J. Symb. Comput..

[33]  Pavel Panchekha,et al.  egg: Fast and Extensible E-graphs , 2020 .

[34]  Thierry Moreau,et al.  Automatic generation of high-performance quantized machine learning kernels , 2020, CGO.

[35]  Kurt Konolige,et al.  Double window optimisation for constant time visual SLAM , 2011, 2011 International Conference on Computer Vision.

[36]  Franz Franchetti,et al.  A Rewriting System for the Vectorization of Signal Transforms , 2006, VECPAR.

[37]  Ricardo E. Gonzalez,et al.  Xtensa: A Configurable and Extensible Processor , 2000, IEEE Micro.

[38]  Brett H. Meyer,et al.  Amdahl’s Law Revisited for Single Chip Systems , 2007, International Journal of Parallel Programming.

[39]  Stijn Eyerman,et al.  Modeling critical sections in Amdahl's law and its implications for multicore design , 2010, ISCA '10.

[40]  Pavel Panchekha,et al.  egg: Fast and extensible equality saturation , 2020, Proc. ACM Program. Lang..