Automatically harnessing sparse acceleration

Sparse linear algebra is central to many scientific programs, yet compilers fail to optimize it well. High-performance libraries are available, but adoption costs are significant. Moreover, libraries tie programs into vendor-specific software and hardware ecosystems, creating non-portable code. In this paper, we develop a new approach based on our specification Language for implementers of Linear Algebra Computations (LiLAC). Rather than requiring the application developer to (re)write every program for a given library, the burden is shifted to a one-off description by the library implementer. The LiLAC-enabled compiler uses this to insert appropriate library routines without source code changes. LiLAC provides automatic data marshaling, maintaining state between calls and minimizing data transfers. Appropriate places for library insertion are detected in compiler intermediate representation, independent of source languages. We evaluated on large-scale scientific applications written in FORTRAN; standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across heterogeneous platforms, applications and data sets we show speedups of 1.1×to over 10×without user intervention.

[1]  Paolo Bientinesi,et al.  Program generation for small-scale linear algebra applications , 2018, CGO.

[2]  Victor Eijkhout,et al.  An iterative solver benchmark , 2001, Sci. Program..

[3]  Arturo González-Escribano,et al.  Multi-device Controllers: A Library to Simplify Parallel Heterogeneous Programming , 2019, International Journal of Parallel Programming.

[4]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[5]  Sharon L. Wolchik 1989 , 2009 .

[6]  Y. Saad,et al.  Krylov Subspace Methods on Supercomputers , 1989 .

[7]  Albert Cohen,et al.  A polyhedral compilation framework for loops with dynamic data-dependent bounds , 2018, CC.

[8]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[9]  Sebastian Hack,et al.  Polly's Polyhedral Scheduling in the Presence of Reductions , 2015, ArXiv.

[10]  J. Doye,et al.  THE DOUBLE-FUNNEL ENERGY LANDSCAPE OF THE 38-ATOM LENNARD-JONES CLUSTER , 1998, cond-mat/9808265.

[11]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[12]  Markus Püschel,et al.  A Basic Linear Algebra Compiler , 2014, CGO '14.

[13]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[14]  Toshio Nakatani,et al.  Detection and global optimization of reduction operations for distributed parallel machines , 1996, ICS '96.

[15]  Cristina V. Lopes SIGPLAN treasurer's report , 2013, SIGP.

[16]  D. Wales Discrete path sampling , 2002 .

[17]  Jonathan Ragan-Kelley,et al.  Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[18]  Yosi Ben-Asher,et al.  Streamlining Whole Function Vectorization in C Using Higher Order Vector Semantics , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[19]  David A. Ham,et al.  An Algorithm for the Optimization of Finite Element Integration Loops , 2016, ACM Trans. Math. Softw..

[20]  Yi Yang,et al.  BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing , 2015, ICS.

[21]  Pierre Jouvelot,et al.  A unified semantic approach for the vectorization and parallelization of generalized reductions , 1989, ICS '89.

[22]  Michael F. P. O'Boyle,et al.  CAnDL: a domain specific language for compiler analysis , 2018, CC.

[23]  Dan Grossman SIGPLAN education board and related activities report , 2011 .

[24]  Gabriel Rodríguez,et al.  Generating piecewise-regular code from irregular structures , 2019, PLDI.

[25]  Sylvain Paris,et al.  Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code , 2015, PLDI.

[26]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[27]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[28]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[29]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[30]  Nectarios Koziris,et al.  SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms , 2018, ACM Trans. Math. Softw..

[31]  J. Demmel,et al.  Sun Microsystems , 1996 .

[32]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[33]  J. Ramanujam,et al.  A framework for enhancing data reuse via associative reordering , 2014, PLDI.

[34]  Allan L. Fisher,et al.  Parallelizing complex scans and reductions , 1994, PLDI '94.

[35]  Alvin Cheung,et al.  Verified lifting of stencil computations , 2016, PLDI.

[36]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[37]  David J. Wales,et al.  Exploiting sparsity in free energy basin-hopping , 2017 .

[38]  Kunle Olukotun,et al.  Delite , 2014, ACM Trans. Embed. Comput. Syst..

[39]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[40]  Michael F. P. O'Boyle,et al.  Type-Directed Program Synthesis and Constraint Generation for Library Portability , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[41]  Paul Feautrier,et al.  Scheduling reductions , 1994, ICS '94.

[42]  Philippe Clauss,et al.  The Polyhedral Model of Nonlinear Loops , 2016, ACM Trans. Archit. Code Optim..

[43]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[44]  Arturo González-Escribano,et al.  Supporting the Xeon Phi Coprocessor in a Heterogeneous Programming Model , 2017, Euro-Par.

[45]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[46]  Rudolf Eigenmann,et al.  Idiom recognition in the Polaris parallelizing compiler , 1995, ICS '95.

[47]  Tomofumi Yuki,et al.  Sparse computation data dependence simplification for efficient compiler-generated inspectors , 2019, PLDI.

[48]  Ron Y. Pinter,et al.  Program optimization and parallelization using idioms , 1991, POPL '91.

[49]  Elnar Hajiyev,et al.  PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[50]  Michael F. P. O'Boyle,et al.  Portable and Transparent Host-Device Communication Optimization for GPGPU Environments , 2014, CGO '14.

[51]  Chi-Chung Lam,et al.  On Optimizing a Class of Multi-Dimensional Loops with Reductions for Parallel Execution , 1997, Parallel Process. Lett..

[52]  Michael F. P. O'Boyle,et al.  Automatic Matching of Legacy Code to Heterogeneous APIs: An Idiomatic Approach , 2018, ASPLOS.

[53]  Shoaib Kamil,et al.  Parallel associative reductions in Halide , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[54]  David A. Bader,et al.  Graphs, Matrices, and the GraphBLAS: Seven Good Reasons , 2015, ICCS.

[55]  Albert Cohen,et al.  Vapor SIMD: Auto-vectorize once, run everywhere , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[56]  Gautam Gupta Simplifying reductions , 2006, POPL '06.