Is Search Really Necessary to Generate High-Performance BLAS?
暂无分享,去创建一个
Gang Ren | David A. Padua | Keshav Pingali | Paul Stodghill | Xiaoming Li | María Jesús Garzarán | Kamen Yotov | K. Pingali | D. Padua | Xiaoming Li | P. Stodghill | M. Garzarán | K. Yotov | Gang Ren | Paul V. Stodghill
[1] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[2] Wei Li,et al. Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.
[3] Philippe Clauss. Counting Solutions to Linear and Nonlinear Constraints Through Ehrhart Polynomials: Applications to Analyze and Transform Scientific Programs , 1996, International Conference on Supercomputing.
[4] Edward G. Coffman,et al. Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.
[5] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[6] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.
[7] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).
[8] Martin Fowler. Yet Another Optimization Article , 2002, IEEE Softw..
[9] Alexandru Nicolau,et al. Advances in languages and compilers for parallel processing , 1991 .
[10] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.
[11] FeautrierPaul. Some efficient solutions to the affine scheduling problem , 1992 .
[12] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[13] Keshav Pingali,et al. Access normalization: loop restructuring for NUMA compilers , 1992, ASPLOS V.
[14] LiWei,et al. Unifying data and control transformations for distributed shared-memory machines , 1995 .
[15] Ken Kennedy,et al. Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..
[16] Shirley Dex,et al. JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .
[17] Greg M. Henry,et al. Flexible High-Performance Matrix Multiply via a Self-Modifying Runtime Code , 2001 .
[18] Siddhartha Chatterjee,et al. Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.
[19] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.
[20] David A. Padua,et al. Advanced compiler optimizations for supercomputers , 1986, CACM.
[21] Emilio L. Zapata,et al. Automatic analytical modeling for the estimation of cache misses , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).
[22] Ramesh C. Agarwal,et al. Engineering and Scientific Subroutine Library Release 3 for IBM ES/3090 Vector Multiprocessors , 1989, IBM Syst. J..
[23] Ken Kennedy,et al. Improving register allocation for subscripted variables , 1990, PLDI '90.
[24] Paul Feautrier,et al. Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.
[25] Adolfy Hoisie,et al. Performance Optimization of Numerically Intensive Codes , 1987 .
[26] David A. Padua,et al. Searching for the Best FFT Formulas with the SPL Compiler , 2000, LCPC.
[27] Keshav Pingali,et al. Data-centric multi-level blocking , 1997, PLDI '97.
[28] Ramesh C. Agarwal,et al. Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch , 1994, IBM J. Res. Dev..
[29] Philippe Clauss,et al. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs , 1996 .
[30] Allen,et al. Optimizing Compilers for Modern Architectures , 2004 .
[31] Michael Wolfe,et al. Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.
[32] R. C. Whaley,et al. Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005, Softw. Pract. Exp..
[33] Jack Dongarra,et al. Automatic Blocking of Nested Loops , 1990 .
[34] Keshav Pingali,et al. Access normalization: loop restructuring for NUMA computers , 1993, TOCS.
[35] Yves Robert,et al. (Pen)-ultimate tiling? , 1994, Integr..
[36] Irving L. Traiger,et al. Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..
[37] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.