A methodology for speeding up matrix vector multiplication for single/multi-core architectures

In this paper, a new methodology for computing the Dense Matrix Vector Multiplication, for both embedded (processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit), is presented. This methodology achieves higher execution speed than ATLAS state-of-the-art library (speedup from 1.2 up to 1.45). This is achieved by fully exploiting the combination of the software (e.g., data reuse) and hardware parameters (e.g., data cache associativity) which are considered simultaneously as one problem and not separately, giving a smaller search space and high-quality solutions. The proposed methodology produces a different schedule for different values of the (i) number of the levels of data cache; (ii) data cache sizes; (iii) data cache associativities; (iv) data cache and main memory latencies; (v) data array layout of the matrix and (vi) number of cores.

[1]  Sameer Kulkarni,et al.  An evaluation of different modeling techniques for iterative compilation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[2]  M. Kaufmann,et al.  Algorithms for SMP-clusters dense matrix-vector multiplication , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[3]  Stefano Crespi-Reghizzi,et al.  Continuous learning of compiler heuristics , 2013, TACO.

[4]  Saman P. Amarasinghe,et al.  Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.

[5]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[6]  Hamid R. Arabnia,et al.  A Parallel Algorithm for the Arbitrary Rotation of Digitized Images Using Process-and-Data-Decomposition Approach , 1990, J. Parallel Distributed Comput..

[7]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[8]  Shlomit S. Pinter,et al.  Register allocation with instruction scheduling: a new approach , 1996, Journal of Programming Languages.

[9]  François Bodin,et al.  A Machine Learning Approach to Automatic Production of Compiler Heuristics , 2002, AIMSA.

[10]  Nan Zhang A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication , 2012, IEEE Transactions on Parallel and Distributed Systems.

[11]  Keith D. Cooper,et al.  Adaptive Optimizing Compilers for the 21st Century , 2002, The Journal of Supercomputing.

[12]  Hamid R. Arabnia,et al.  Parallel stereocorrelation on a reconfigurable multi-ring network , 1996, The Journal of Supercomputing.

[13]  Franz Franchetti,et al.  Computer Generation of Hardware for Linear Digital Signal Processing Transforms , 2012, TODE.

[14]  Konstantinos G. Margaritis,et al.  Performance Models for Matrix Computations on Multicore Processors Using OpenMP , 2010, 2010 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[15]  David I. August,et al.  Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[16]  A KulkarniPrasad,et al.  Practical exhaustive optimization phase order exploration and evaluation , 2009 .

[17]  Gary S. Tyson,et al.  Practical exhaustive optimization phase order exploration and evaluation , 2009, TACO.

[18]  Douglas L. Jones,et al.  Fast searches for effective optimization phase sequences , 2004, PLDI '04.

[19]  Hamid R. Arabnia,et al.  Parallel Edge-Region-Based Segmentation Algorithm Targeted at Reconfigurable MultiRing Network , 2003, The Journal of Supercomputing.

[20]  Hamid R. Arabnia,et al.  The REFINE Multiprocessor - Theoretical Properties and Algorithms , 1995, Parallel Comput..

[21]  H.R. Arabnia,et al.  A Transputer Network for Fast Operations on Digitised Images , 1989, Comput. Graph. Forum.

[22]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[23]  Hamid R. Arabnia,et al.  A distributed stereocorrelation algorithm , 1995, Proceedings of Fourth International Conference on Computer Communications and Networks - IC3N'95.

[24]  Noriyuki Fujimoto Dense Matrix-Vector Multiplication on the CUDA Architecture , 2008, Parallel Process. Lett..

[25]  Hamid R. Arabnia,et al.  A Transputer Network for the Arbitrary Rotation of Digitised Images , 1987, Comput. J..

[26]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[27]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[28]  Hamid R. Arabnia,et al.  Arbitrary Rotation of Raster Images with SIMD Machine Architectures , 1987, Comput. Graph. Forum.

[29]  Michael F. P. O'Boyle,et al.  Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[30]  N. Fujimoto,et al.  Faster matrix-vector multiplication on GeForce 8800GTX , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[31]  Martin Schulz,et al.  On the Performance of an Algebraic Multigrid Solver on Multicore Clusters , 2010, VECPAR.

[32]  Alexander Krivutsenko GotoBLAS - Anatomy of a fast matrix multiplication High performance libraries in computational science , 2008 .

[33]  E. Granston,et al.  Automatic Recommendation of Compiler Options , 2001 .

[34]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[35]  Michael F. P. O'Boyle,et al.  A Feasibility Study in Iterative Compilation , 1999, ISHPC.

[36]  Hans Henrik Brandenborg Sørensen,et al.  High-Performance Matrix-Vector Multiplication on the GPU , 2011, Euro-Par Workshops.

[37]  Nectarios Koziris,et al.  Performance evaluation of the sparse matrix-vector multiplication on modern architectures , 2009, The Journal of Supercomputing.

[38]  R. C. Whaley,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005, Softw. Pract. Exp..

[39]  S.M. Bhandarkar,et al.  The Hough Transform on a Reconfigurable Multi-Ring Network , 1995, J. Parallel Distributed Comput..

[40]  Steven J. Plimpton,et al.  An Efficient Parallel Algorithm for Matrix-Vector Multiplication , 1995, Int. J. High Speed Comput..

[41]  Hamid R. Arabnia,et al.  A Reconfigurable Architecture for Image Processing and Computer Vision , 1995, Int. J. Pattern Recognit. Artif. Intell..

[42]  Antoine Petitet,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005 .

[43]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[44]  Vivek N. Waghmare,et al.  Performance Analysis of Matrix-Vector Multiplication in Hybrid (MPI + OpenMP) , 2011 .

[45]  Ghassan Shobaki,et al.  Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach , 2013, ACM Trans. Archit. Code Optim..

[46]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.