论文信息 - Time and energy modeling of a high-performance multi-threaded Cholesky factorization

Time and energy modeling of a high-performance multi-threaded Cholesky factorization

We present accurate time and energy piece-wise models of high-performance multi-threaded implementations for the general matrix multiplication, triangular system solve with multiple right-hand sides, and symmetric rank-k update. Furthermore, these are then assembled to provide accurate models of the Cholesky factorization built on top of these Level-3 BLAS operations. Our models consider the costs, in terms of time and energy, of the floating-point operations involved in the routines as well as the overhead due to data movements across the levels of the memory hierarchy. The accuracy of the multi-threaded models is tested on an Intel Xeon E5-2620 processor, reporting relative errors for the Cholesky factorization that are, respectively, around 2.4 and 2.9 % on average for time and energy.

[1] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[2] Eduard Ayguadé,et al. Decomposable and responsive power models for multicore processors using performance counters , 2010, ICS '10.

[3] Enrique S. Quintana-Ortí,et al. Time and energy modeling of high-performance Level-3 BLAS on x86 architectures , 2015, Simul. Model. Pract. Theory.

[4] Robert A. van de Geijn,et al. BLIS : A Framework for Generating BLAS-like Libraries FLAME Working , 2012 .

[5] Richard W. Vuduc,et al. A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[6] Sally A. McKee,et al. Portable, scalable, per-core power estimation for intelligent resource management , 2010, International Conference on Green Computing.

[7] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[8] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[9] Gang Ren,et al. Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[10] Gene H. Golub,et al. Matrix computations (3rd ed.) , 1996 .

[11] Tze Meng Low,et al. Analytical Modeling Is Enough for High-Performance BLIS , 2016, ACM Trans. Math. Softw..

[12] Gokcen Kestor,et al. Quantifying the energy cost of data movement in scientific applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[13] Gene H. Golub,et al. Matrix computations , 1983 .

[14] Francisco J. Cazorla,et al. Hardware support for accurate per-task energy metering in multicore systems , 2013, TACO.

[15] Endong Wang,et al. Intel Math Kernel Library , 2014 .

[16] Rahul Khanna,et al. RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).