Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor
暂无分享,去创建一个
[1] V. Strassen. Gaussian elimination is not optimal , 1969 .
[2] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .
[3] gazon synthétique,et al. Operations , 1961 .
[4] David A. Patterson,et al. Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .
[5] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[6] Samuel Williams,et al. The potential of the cell processor for scientific computing , 2005, CF '06.
[7] Ed Anderson,et al. LAPACK Users' Guide , 1995 .
[8] Jack Dongarra,et al. 1. High-Performance Computing , 1998 .
[9] D. Geer,et al. Chip makers turn to multicore processors , 2005, Computer.
[10] B. Flachs,et al. A streaming processing unit for a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..
[11] W. Paul,et al. Computer Architecture , 2000, Springer Berlin Heidelberg.
[12] Jack J. Dongarra,et al. Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.
[13] Jason N. Dale,et al. Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..
[14] Don Coppersmith,et al. Matrix multiplication via arithmetic progressions , 1987, STOC.
[15] Martin Hopkins,et al. Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.
[16] ChenT.,et al. Cell Broadband Engine Architecture and its first implementation—A view , 2007 .
[17] Viktor K. Prasanna,et al. Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..
[18] Shekhar Y. Borkar,et al. Design challenges of technology scaling , 1999, IEEE Micro.
[19] DongarraJack,et al. A class of parallel tiled linear algebra algorithms for multicore architectures , 2009 .
[20] DongarraJack,et al. Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008 .
[21] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .
[22] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.
[23] Jack J. Dongarra,et al. Implementation of mixed precision in solving systems of linear equations on the Cell processor , 2007, Concurr. Comput. Pract. Exp..
[24] DongarraJack,et al. Parallel tiled QR factorization for multicore architectures , 2008 .
[25] Jaewook Shin,et al. Exploiting Superword-Level Locality in Multimedia Extension Architectures , 2003, J. Instr. Level Parallelism.
[26] James Demmel,et al. Applied Numerical Linear Algebra , 1997 .
[27] Jack Dongarra,et al. ScaLAPACK Users' Guide , 1987 .
[28] B. Flachs,et al. The microarchitecture of the synergistic processor for a cell processor , 2006, IEEE Journal of Solid-State Circuits.
[29] Viktor K. Prasanna,et al. Analysis of memory hierarchy performance of block data layout , 2002, Proceedings International Conference on Parallel Processing.
[30] Samuel Williams,et al. Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.
[31] Juan J. Navarro,et al. Using Non-canonical Array Layouts in Dense Matrix Operations , 2006, PARA.
[32] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..
[33] Jack Dongarra,et al. Numerical Linear Algebra for High-Performance Computers , 1998 .
[34] Robert A. van de Geijn,et al. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.
[35] Douglas Aberdeen,et al. Emmerald: a fast matrix–matrix multiply using Intel's SSE instructions , 2001, Concurr. Comput. Pract. Exp..