Automatic generation of fast BLAS3-GEMM: A portable compiler approach
暂无分享,去创建一个
[1] Yang Yang,et al. Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[2] Hao Zhou,et al. Exploiting mixed SIMD parallelism by reducing data reorganization overhead , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[3] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[4] Jingling Xue,et al. Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.
[5] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.
[6] Zhang Yunquan,et al. Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor , 2012, ICPADS.
[7] Guang R. Gao,et al. A Register Allocation Framework Based on Hierarchical Cyclic Interval Graphs , 1992, CC.
[8] Ramesh C. Agarwal,et al. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms , 1994, IBM J. Res. Dev..
[9] Robert A. van de Geijn,et al. Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[10] Gregory J. Chaitin,et al. Register allocation & spilling via graph coloring , 1982, SIGPLAN '82.
[11] Qing Yi,et al. Layout-oblivious compiler optimization for matrix computations , 2013, TACO.
[12] Gang Ren,et al. Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.
[13] Qing Yi,et al. Automated programmable control and parameterization of compiler optimizations , 2011, International Symposium on Code Generation and Optimization (CGO 2011).
[14] Richard Veras,et al. When polyhedral transformations meet SIMD code generation , 2013, PLDI.
[15] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.
[16] Josep Llosa,et al. Swing module scheduling: a lifetime-sensitive approach , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.
[17] Constantinos E. Goutis,et al. A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD , 2014, The Journal of Supercomputing.
[18] Michael D. Smith,et al. A generalized algorithm for graph-coloring register allocation , 2004, PLDI '04.
[19] Saman P. Amarasinghe,et al. Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.
[20] Peng Wu,et al. Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.
[21] Christine Eisenbeis,et al. The meeting graph: a new model for loop cyclic register allocation , 1995, PACT.
[22] Lin Gao,et al. Thread-Sensitive Modulo Scheduling for Multicore Processors , 2008, 2008 37th International Conference on Parallel Processing.
[23] B. Ramakrishna Rau,et al. Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.
[24] E. R. Jessup,et al. Automatic Generation of Tiled and Parallel Linear Algebra Routines A partitioning framework for the BTO Compiler , 2010 .
[25] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..
[26] Qian Wang,et al. AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[27] Markus Püschel,et al. A Basic Linear Algebra Compiler , 2014, CGO '14.
[28] Monica S. Lam,et al. RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .
[29] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[30] Wen-mei W. Hwu,et al. Unrolling-based optimizations for modulo scheduling , 1995, MICRO 1995.
[31] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[32] Li Wang,et al. Reuse-aware modulo scheduling for stream processors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).
[33] Dongrui Fan,et al. Extendable pattern-oriented optimization directives , 2012, International Symposium on Code Generation and Optimization (CGO 2011).
[34] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.
[35] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[36] Hao Zhou,et al. A Compiler Approach for Exploiting Partial SIMD Parallelism , 2016, ACM Trans. Archit. Code Optim..
[37] Tze Meng Low,et al. Analytical Modeling Is Enough for High-Performance BLIS , 2016, ACM Trans. Math. Softw..