A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD
暂无分享,去创建一个
Constantinos E. Goutis | Vasilios I. Kelefouras | Angeliki Kritikakou | C. Goutis | A. Kritikakou | V. Kelefouras
[1] Robert A. van de Geijn,et al. SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .
[2] Jarek Nieplocha,et al. Memory efficient parallel matrix multiplication operation for irregular problems , 2006, CF '06.
[3] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.
[4] Martin Fleury,et al. Matrix Multiplication Performance on Commodity Shared-Memory Multiprocessors , 2004 .
[5] Chi-Bang Kuan,et al. Automated Empirical Optimization , 2011, Encyclopedia of Parallel Computing.
[6] Dimitrios S. Nikolopoulos. Code and Data Transformations for Improving Shared Cache Performance on SMT Processors , 2003, ISHPC.
[7] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[8] François Bodin,et al. A Machine Learning Approach to Automatic Production of Compiler Heuristics , 2002, AIMSA.
[9] Antoine Petitet,et al. Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005 .
[10] Saman P. Amarasinghe,et al. Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.
[11] Mithuna Thottethodi,et al. Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[12] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[13] Frédéric Suter,et al. Mixed Parallel Implementations of Strassen and Winograd Matrix Multiplication Algorithms , 2001 .
[14] Pierre Michaud. Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view , 2011, HiPEAC.
[15] David F. Bacon,et al. Compiler transformations for high-performance computing , 1994, CSUR.
[16] Thomas Rauber,et al. Automatic Tuning of PDGEMM Towards Optimal Performance , 2005, Euro-Par.
[17] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.
[18] Thomas Rauber,et al. Multilevel hierarchical matrix multiplication on clusters , 2004, ICS '04.
[19] Nathan A. Carr,et al. Cache and bandwidth aware matrix multiplication on the GPU , 2010 .
[20] Steven G. Johnson,et al. The Fastest Fourier Transform in the West , 1997 .
[21] Shlomit S. Pinter,et al. Register allocation with instruction scheduling: a new approach , 1996, Journal of Programming Languages.
[22] Ghassan Shobaki,et al. Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach , 2013, ACM Trans. Archit. Code Optim..
[23] E. Granston,et al. Automatic Recommendation of Compiler Options , 2001 .
[24] Michael Schwind,et al. Fast recursive matrix multiplication for multi-core architectures , 2010, ICCS.
[25] Jarek Nieplocha,et al. SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[26] Christos Faloutsos,et al. Analysis of the Clustering Properties of the Hilbert Space-Filling Curve , 2001, IEEE Trans. Knowl. Data Eng..
[27] Stefano Crespi-Reghizzi,et al. Continuous learning of compiler heuristics , 2013, TACO.
[28] Dong Zhou,et al. Translation techniques in cross-language information retrieval , 2012, CSUR.
[29] Michael F. P. O'Boyle,et al. A Feasibility Study in Iterative Compilation , 1999, ISHPC.
[30] Jack Dongarra,et al. ScaLAPACK user's guide , 1997 .
[31] Keshav Pingali,et al. An Experimental Study of Self-Optimizing Dense Linear Algebra Software , 2008, Proceedings of the IEEE.
[32] Marc Snir,et al. Automatic tuning matrix multiplication performance on graphics hardware , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).
[33] Mithuna Thottethodi,et al. Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.
[34] Guang R. Gao,et al. Optimized Dense Matrix Multiplication on a Many-Core Architecture , 2010, Euro-Par.
[35] Frédéric Suter,et al. Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms: Research Articles , 2004 .
[36] Michael F. P. O'Boyle,et al. Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[37] Wei Chen,et al. Parallel matrix-multiplication algorithm for distributed parallel computers , 2000, Systems and Computers in Japan.
[38] Gang Ren,et al. Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.
[39] Keith D. Cooper,et al. Adaptive Optimizing Compilers for the 21st Century , 2002, The Journal of Supercomputing.
[40] Franz Franchetti,et al. Computer Generation of Hardware for Linear Digital Signal Processing Transforms , 2012, TODE.
[41] Petter E. Bjørstad,et al. Efficient Matrix Multiplication on SIMD Computers , 1992, SIAM J. Matrix Anal. Appl..
[42] Dongrui Fan,et al. High Performance Matrix Multiplication on Many Cores , 2009, Euro-Par.
[43] Alexander Krivutsenko. GotoBLAS - Anatomy of a fast matrix multiplication High performance libraries in computational science , 2008 .
[44] Jaeyoung Choi. A new parallel matrix multiplication algorithm on distributed-memory concurrent computers , 1998, Concurr. Pract. Exp..
[45] V. Strassen. Gaussian elimination is not optimal , 1969 .
[46] Jack J. Dongarra,et al. Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor , 2009, Parallel Comput..
[47] Manuel Prieto,et al. Survey of scheduling techniques for addressing shared resources in multicore processors , 2012, CSUR.
[48] Gary S. Tyson,et al. Practical exhaustive optimization phase order exploration and evaluation , 2009, TACO.
[49] Douglas L. Jones,et al. Fast searches for effective optimization phase sequences , 2004, PLDI '04.
[50] Nicholas Nethercote,et al. Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.
[51] Martin Fleury,et al. Matrix Multiplication Performance on Commodity Shared-Memory Multiprocessors , 2004, Parallel Computing in Electrical Engineering, 2004. International Conference on.
[52] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..
[53] Sameer Kulkarni,et al. An evaluation of different modeling techniques for iterative compilation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).
[54] David I. August,et al. Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..
[55] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.