Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance
暂无分享,去创建一个
[1] Gaston H. Gonnet,et al. The analysis of multidimensional searching in quad-trees , 1991, SODA '91.
[2] W. Jalby,et al. To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.
[3] Alok Aggarwal,et al. Hierarchical memory with block transfer , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).
[4] Don Coppersmith,et al. Matrix multiplication via arithmetic progressions , 1987, STOC.
[5] Steven G. Johnson,et al. The Fastest Fourier Transform in the West , 1997 .
[6] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.
[7] Bo Kågström,et al. Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues , 1998, TOMS.
[8] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).
[9] Matteo Frigo,et al. Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).
[10] Nicholas J. Higham,et al. INVERSE PROBLEMS NEWSLETTER , 1991 .
[11] F. P. Preparata,et al. Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part I, Upper Bounds , 1995, Theory of Computing Systems.
[12] Mithuna Thottethodi,et al. Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.
[13] Michael Rodeh,et al. Matrix Multiplication: A Case Study of Algorithm Engineering , 1998, WAE.
[14] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.
[15] Andrea Pietracaprina,et al. On the Space and Access Complexity of Computation DAGs , 2000, WG.
[16] Gianfranco Bilardi,et al. An approach towards an analytical characterization of locality and its portability , 2001, 2001 Innovative Architecture for Future Generation High-Performance Processors and Systems.
[17] Fred G. Gustavson,et al. Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..
[18] Jeremy D. Frens,et al. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.
[19] Hiroshi Nakamura,et al. Improving cache Performance Through Tiling and Data Alignment , 1997, IRREGULAR.
[20] Michael Wolfe,et al. High performance compilers for parallel computing , 1995 .
[21] Mithuna Thottethodi,et al. Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[22] Josef Stoer,et al. Numerische Mathematik 1 , 1989 .
[23] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[24] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.
[25] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .
[26] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.
[27] Isak Jonsson,et al. Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms , 1998, PARA.
[28] Bowen Alpern,et al. A model for hierarchical memory , 1987, STOC.
[29] Mithuna Thottethodi,et al. Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.
[30] Gianfranco Bilardi,et al. A Characterization of Temporal Locality and Its Portability across Memory Hierarchies , 2001, ICALP.
[31] David S. Wise. Undulant-Block Elimination and Integer-Preserving Matrix Inversion , 1999, Sci. Comput. Program..
[32] Ken Kennedy,et al. Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.
[33] Sivan Toledo. Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..
[34] Michel J. Daydé,et al. The RISC BLAS: a blocked implementation of level 3 BLAS for RISC processors , 1999, TOMS.
[35] Rudolf Eigenmann,et al. Automatic program parallelization , 1993, Proc. IEEE.
[36] P. D''Alberto. Performance Evaluation of Data Locality Exploitation (Ph.D. Thesis) , 2000 .
[37] V. Strassen. Gaussian elimination is not optimal , 1969 .
[38] Dhabaleswar K. Panda,et al. Communication and Architectural Support for Network-Based Parallel Computing , 1997, Lecture Notes in Computer Science.