论文信息 - AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs

AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs

Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our templatebased approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.

[1] Qing Yi,et al. Layout-oblivious compiler optimization for matrix computations , 2013, TACO.

[2] James Demmel,et al. Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[3] Apan Qasem,et al. Exploring the Optimization Space of Dense Linear Algebra Kernels , 2008, LCPC.

[4] Chun Chen,et al. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[5] Qing Yi,et al. Automated programmable control and parameterization of compiler optimizations , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[6] Elizabeth R. Jessup,et al. Automating the generation of composed linear algebra kernels , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[8] Elizabeth R. Jessup,et al. Build to order linear algebra kernels , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9] Richard W. Vuduc,et al. POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[10] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[11] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[12] Zhang Yunquan,et al. Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor , 2012, ICPADS.

[13] Alexandru Nicolau,et al. Adaptive Strassen's matrix multiplication , 2007, ICS '07.

[14] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[15] Qing Yi,et al. POET: a scripting language for applying parameterized source‐to‐source program transformations , 2012, Softw. Pract. Exp..

[16] Dongrui Fan,et al. Extendable pattern-oriented optimization directives , 2012, International Symposium on Code Generation and Optimization (CGO 2011).

[17] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.

[18] Keith D. Cooper,et al. Engineering a Compiler , 2003 .

[19] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.

[20] Yang Yang,et al. Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.