Automatic generation of fast BLAS3-GEMM: A portable compiler approach

GEMM is the main computational kernel in BLAS3. Its micro-kernel is either hand-crafted in assembly code or generated from C code by general-purpose compilers (guided by architecture-specific directives or auto-tuning). Therefore, either performance or portability suffers. We present a POrtable Compiler Approach, Poca, implemented in LLVM, to automatically generate and optimize this micro-kernel in an architecture-independent manner, without involving domain experts. The key insight is to leverage a wide range of architecture-specific abstractions already available in LLVM, by first generating a vectorized micro-kernel in the architecture-independent LLVM IR and then improving its performance by applying a series of domain-specific yet architecture-independent optimizations. The optimized micro-kernel drops easily in existing GEMM frameworks such as BLIS and OpenBLAS. Validation focuses on optimizing GEMM in double precision on two architectures. On Intel Sandybridge and AArch64 Cortex-A57, Poca's micro-kernels outperform expert-crafted assembly code by 2.35% and 7.54%, respectively, and both BLIS and OpenBLAS achieve competitive or better performance once their micro-kernels are replaced by Poca's.

[1]  Yang Yang,et al.  Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[2]  Hao Zhou,et al.  Exploiting mixed SIMD parallelism by reducing data reorganization overhead , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[3]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[4]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[5]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[6]  Zhang Yunquan,et al.  Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor , 2012, ICPADS.

[7]  Guang R. Gao,et al.  A Register Allocation Framework Based on Hierarchical Cyclic Interval Graphs , 1992, CC.

[8]  Ramesh C. Agarwal,et al.  Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms , 1994, IBM J. Res. Dev..

[9]  Robert A. van de Geijn,et al.  Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[10]  Gregory J. Chaitin,et al.  Register allocation & spilling via graph coloring , 1982, SIGPLAN '82.

[11]  Qing Yi,et al.  Layout-oblivious compiler optimization for matrix computations , 2013, TACO.

[12]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[13]  Qing Yi,et al.  Automated programmable control and parameterization of compiler optimizations , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[14]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[15]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[16]  Josep Llosa,et al.  Swing module scheduling: a lifetime-sensitive approach , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[17]  Constantinos E. Goutis,et al.  A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD , 2014, The Journal of Supercomputing.

[18]  Michael D. Smith,et al.  A generalized algorithm for graph-coloring register allocation , 2004, PLDI '04.

[19]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[20]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[21]  Christine Eisenbeis,et al.  The meeting graph: a new model for loop cyclic register allocation , 1995, PACT.

[22]  Lin Gao,et al.  Thread-Sensitive Modulo Scheduling for Multicore Processors , 2008, 2008 37th International Conference on Parallel Processing.

[23]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[24]  E. R. Jessup,et al.  Automatic Generation of Tiled and Parallel Linear Algebra Routines A partitioning framework for the BTO Compiler , 2010 .

[25]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[26]  Qian Wang,et al.  AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[27]  Markus Püschel,et al.  A Basic Linear Algebra Compiler , 2014, CGO '14.

[28]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[29]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[30]  Wen-mei W. Hwu,et al.  Unrolling-based optimizations for modulo scheduling , 1995, MICRO 1995.

[31]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[32]  Li Wang,et al.  Reuse-aware modulo scheduling for stream processors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[33]  Dongrui Fan,et al.  Extendable pattern-oriented optimization directives , 2012, International Symposium on Code Generation and Optimization (CGO 2011).

[34]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[35]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[36]  Hao Zhou,et al.  A Compiler Approach for Exploiting Partial SIMD Parallelism , 2016, ACM Trans. Archit. Code Optim..

[37]  Tze Meng Low,et al.  Analytical Modeling Is Enough for High-Performance BLIS , 2016, ACM Trans. Math. Softw..