论文信息 - Model-Driven SIMD Code Generation for a Multi-resolution Tensor Kernel

Model-Driven SIMD Code Generation for a Multi-resolution Tensor Kernel

In this paper, we describe a model-driven compile-time code generator that transforms a class of tensor contraction expressions into highly optimized short-vector SIMD code. We use as a case study a multi-resolution tensor kernel from the MADNESS quantum chemistry application. Performance of a C-based implementation is low, and because the dimensions of the tensors are small, performance using vendor optimized BLAS libraries is also sub optimal. We develop a model-driven code generator that determines the optimal loop permutation and placement of vector load/store, transpose, and splat operations in the generated code, enabling portable performance on short-vector SIMD architectures. Experimental results on an SSE-based platform demonstrate the efficiency of the vector-code synthesizer.

[1] Albert Cohen,et al. Polyhedral-Model Guided Loop-Nest Auto-Vectorization , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[2] Chun Chen,et al. Speeding up Nek5000 with autotuning and specialization , 2010, ICS '10.

[3] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[4] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[5] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[6] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[7] Sriram Krishnamoorthy,et al. Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[8] Rainer Leupers,et al. A SIMD optimization framework for retargetable compilers , 2009, TACO.

[9] Robert J. Harrison,et al. Multiresolution Quantum Chemistry in Multiwavelet Bases , 2003, International Conference on Computational Science.

[10] Robert J. Harrison,et al. Singular operators in multiwavelet bases , 2004, IBM J. Res. Dev..

[11] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.

[12] James Demmel,et al. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[13] Ayal Zaks,et al. Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[15] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.

[16] G. Beylkin,et al. Multiresolution quantum chemistry in multiwavelet bases: Analytic derivatives for Hartree-Fock and density functional theory. , 2004, The Journal of chemical physics.

[17] Matemática,et al. Society for Industrial and Applied Mathematics , 2010 .

[18] J. Ramanujam,et al. Parameterized tiling revisited , 2010, CGO '10.

[19] Sanjay V. Rajopadhye,et al. Parameterized tiled loops for free , 2007, PLDI '07.

[20] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[21] Gregory Beylkin,et al. Multiresolution quantum chemistry: basic theory and initial applications. , 2004, The Journal of chemical physics.

[22] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[23] James Demmel,et al. LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.