论文信息 - Time and energy modeling of high-performance Level-3 BLAS on x86 architectures

Time and energy modeling of high-performance Level-3 BLAS on x86 architectures

Abstract We present accurate piece-wise models for the time and energy costs of high performance implementations of both the matrix multiplication ( gemm ) and the triangular system solve with multiple right-hand sides ( trsm ) on x86 architectures. Our methodology decouples the costs due to the floating-point arithmetic/data movement occurring in the higher levels of the cache hierarchy from those of packing/data transfers between the main memory and the L2/L3 cache. A careful analytical study of the data transfers, in combination with an architecture-specific calibration of the costs per operation, render then the components to assemble piece-wise models for the accurate estimation of gemm and trsm ’s performance on x86 processors. Our experimental results on an Intel Xeon E5-2620 processor confirm the accuracy of this approach, which reports relative errors for different shapes of gemm and trsm that are, respectively, around 1.5% and 4.5% on average for both time and energy.

[1] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[2] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[3] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[4] Jack J. Dongarra,et al. An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[5] Enrique S. Quintana-Ortí,et al. Analyzing the Energy Efficiency of the Memory Subsystem in Multicore Processors , 2014, 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[6] Eduard Ayguadé,et al. Decomposable and responsive power models for multicore processors using performance counters , 2010, ICS '10.

[7] Endong Wang,et al. Intel Math Kernel Library , 2014 .

[8] Rahul Khanna,et al. RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[9] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[10] Sally A. McKee,et al. Portable, scalable, per-core power estimation for intelligent resource management , 2010, International Conference on Green Computing.

[11] Francisco J. Cazorla,et al. Hardware support for accurate per-task energy metering in multicore systems , 2013, TACO.

[12] Gene H. Golub,et al. Matrix computations , 1983 .

[13] Enrique S. Quintana-Ortí,et al. DVFS-control techniques for dense linear algebra operations on multi-core processors , 2012, Computer Science - Research and Development.

[14] Richard W. Vuduc,et al. A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[16] David Black-Schaffer,et al. The HIPEAC vision for advanced computing in horizon 2020 , 2013 .

[17] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[18] Karthikeyan Sankaralingam,et al. Dark silicon and the end of multicore scaling , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[19] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[20] Gang Ren,et al. Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[21] Gokcen Kestor,et al. Quantifying the energy cost of data movement in scientific applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[22] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.

[23] Robert A. van de Geijn,et al. Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[24] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.