A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD

In this paper, a new methodology for speeding up Matrix–Matrix Multiplication using Single Instruction Multiple Data unit, at one and more cores having a shared cache, is presented. This methodology achieves higher execution speed than ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number of instructions (load/store and arithmetic) and the data cache accesses and misses in the memory hierarchy. This is achieved by fully exploiting the software characteristics (e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities) as one problem and not separately, giving high quality solutions and a smaller search space.

[1]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[2]  Jarek Nieplocha,et al.  Memory efficient parallel matrix multiplication operation for irregular problems , 2006, CF '06.

[3]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[4]  Martin Fleury,et al.  Matrix Multiplication Performance on Commodity Shared-Memory Multiprocessors , 2004 .

[5]  Chi-Bang Kuan,et al.  Automated Empirical Optimization , 2011, Encyclopedia of Parallel Computing.

[6]  Dimitrios S. Nikolopoulos Code and Data Transformations for Improving Shared Cache Performance on SMT Processors , 2003, ISHPC.

[7]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[8]  François Bodin,et al.  A Machine Learning Approach to Automatic Production of Compiler Heuristics , 2002, AIMSA.

[9]  Antoine Petitet,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005 .

[10]  Saman P. Amarasinghe,et al.  Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.

[11]  Mithuna Thottethodi,et al.  Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[12]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[13]  Frédéric Suter,et al.  Mixed Parallel Implementations of Strassen and Winograd Matrix Multiplication Algorithms , 2001 .

[14]  Pierre Michaud Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view , 2011, HiPEAC.

[15]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[16]  Thomas Rauber,et al.  Automatic Tuning of PDGEMM Towards Optimal Performance , 2005, Euro-Par.

[17]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[18]  Thomas Rauber,et al.  Multilevel hierarchical matrix multiplication on clusters , 2004, ICS '04.

[19]  Nathan A. Carr,et al.  Cache and bandwidth aware matrix multiplication on the GPU , 2010 .

[20]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[21]  Shlomit S. Pinter,et al.  Register allocation with instruction scheduling: a new approach , 1996, Journal of Programming Languages.

[22]  Ghassan Shobaki,et al.  Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach , 2013, ACM Trans. Archit. Code Optim..

[23]  E. Granston,et al.  Automatic Recommendation of Compiler Options , 2001 .

[24]  Michael Schwind,et al.  Fast recursive matrix multiplication for multi-core architectures , 2010, ICCS.

[25]  Jarek Nieplocha,et al.  SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[26]  Christos Faloutsos,et al.  Analysis of the Clustering Properties of the Hilbert Space-Filling Curve , 2001, IEEE Trans. Knowl. Data Eng..

[27]  Stefano Crespi-Reghizzi,et al.  Continuous learning of compiler heuristics , 2013, TACO.

[28]  Dong Zhou,et al.  Translation techniques in cross-language information retrieval , 2012, CSUR.

[29]  Michael F. P. O'Boyle,et al.  A Feasibility Study in Iterative Compilation , 1999, ISHPC.

[30]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[31]  Keshav Pingali,et al.  An Experimental Study of Self-Optimizing Dense Linear Algebra Software , 2008, Proceedings of the IEEE.

[32]  Marc Snir,et al.  Automatic tuning matrix multiplication performance on graphics hardware , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[33]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[34]  Guang R. Gao,et al.  Optimized Dense Matrix Multiplication on a Many-Core Architecture , 2010, Euro-Par.

[35]  Frédéric Suter,et al.  Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms: Research Articles , 2004 .

[36]  Michael F. P. O'Boyle,et al.  Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[37]  Wei Chen,et al.  Parallel matrix-multiplication algorithm for distributed parallel computers , 2000, Systems and Computers in Japan.

[38]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[39]  Keith D. Cooper,et al.  Adaptive Optimizing Compilers for the 21st Century , 2002, The Journal of Supercomputing.

[40]  Franz Franchetti,et al.  Computer Generation of Hardware for Linear Digital Signal Processing Transforms , 2012, TODE.

[41]  Petter E. Bjørstad,et al.  Efficient Matrix Multiplication on SIMD Computers , 1992, SIAM J. Matrix Anal. Appl..

[42]  Dongrui Fan,et al.  High Performance Matrix Multiplication on Many Cores , 2009, Euro-Par.

[43]  Alexander Krivutsenko GotoBLAS - Anatomy of a fast matrix multiplication High performance libraries in computational science , 2008 .

[44]  Jaeyoung Choi A new parallel matrix multiplication algorithm on distributed-memory concurrent computers , 1998, Concurr. Pract. Exp..

[45]  V. Strassen Gaussian elimination is not optimal , 1969 .

[46]  Jack J. Dongarra,et al.  Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor , 2009, Parallel Comput..

[47]  Manuel Prieto,et al.  Survey of scheduling techniques for addressing shared resources in multicore processors , 2012, CSUR.

[48]  Gary S. Tyson,et al.  Practical exhaustive optimization phase order exploration and evaluation , 2009, TACO.

[49]  Douglas L. Jones,et al.  Fast searches for effective optimization phase sequences , 2004, PLDI '04.

[50]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[51]  Martin Fleury,et al.  Matrix Multiplication Performance on Commodity Shared-Memory Multiprocessors , 2004, Parallel Computing in Electrical Engineering, 2004. International Conference on.

[52]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[53]  Sameer Kulkarni,et al.  An evaluation of different modeling techniques for iterative compilation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[54]  David I. August,et al.  Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[55]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.