SCP
暂无分享,去创建一个
Canqun Yang | Xiangke Liao | Xing Su | Jingling Xue | Hao Jiang | Jingling Xue | Xiangke Liao | Hao Jiang | Canqun Yang | Xing Su
[1] Qing Yi,et al. Automated programmable control and parameterization of compiler optimizations , 2011, International Symposium on Code Generation and Optimization (CGO 2011).
[2] Zhao Zhang,et al. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.
[3] Qian Wang,et al. Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[4] Jack J. Dongarra,et al. Porting the PLASMA Numerical Library to the OpenMP Standard , 2017, International Journal of Parallel Programming.
[5] Yang Yang,et al. Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[6] Markus Püschel,et al. A Basic Linear Algebra Compiler , 2014, CGO '14.
[7] Constantinos E. Goutis,et al. A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD , 2014, The Journal of Supercomputing.
[8] Saman P. Amarasinghe,et al. Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.
[9] Peng Wu,et al. Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.
[10] Qian Wang,et al. AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[11] Jack J. Dongarra,et al. Scheduling dense linear algebra operations on multicore processors , 2010, Concurr. Comput. Pract. Exp..
[12] Hao Zhou,et al. Exploiting mixed SIMD parallelism by reducing data reorganization overhead , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[13] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[14] Richard Veras,et al. When polyhedral transformations meet SIMD code generation , 2013, PLDI.
[15] Jingling Xue,et al. Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.
[16] Julien Langou,et al. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..
[17] Hao Zhou,et al. A Compiler Approach for Exploiting Partial SIMD Parallelism , 2016, ACM Trans. Archit. Code Optim..
[18] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Qingfeng Hu,et al. High-Performance Matrix Multiply on a Massively Multithreaded Fiteng1000 Processor , 2012, ICA3PP.
[20] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.
[21] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..
[22] Canqun Yang,et al. Design and Implementation of a Highly Efficient DGEMM for 64-Bit ARMv8 Multi-core Processors , 2015, 2015 44th International Conference on Parallel Processing.
[23] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[24] Gang Ren,et al. Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.
[25] Tze Meng Low,et al. The BLIS Framework , 2016 .
[26] Xiangke Liao,et al. Automatic generation of fast BLAS3-GEMM: A portable compiler approach , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[27] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.
[28] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.
[29] Richard W. Vuduc,et al. POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[30] Bo Kågström,et al. Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues , 1998, TOMS.
[31] Tze Meng Low,et al. Analytical Modeling Is Enough for High-Performance BLIS , 2016, ACM Trans. Math. Softw..
[32] Zhang Yunquan,et al. Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor , 2012, ICPADS.
[33] Jack J. Dongarra,et al. Collecting Performance Data with PAPI-C , 2009, Parallel Tools Workshop.
[34] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.
[35] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[36] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[37] Hans Werner Meuer,et al. Top500 Supercomputer Sites , 1997 .
[38] Aamer Jaleel,et al. Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[39] Robert A. van de Geijn,et al. Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.