Time and energy modeling of a high-performance multi-threaded Cholesky factorization

We present accurate time and energy piece-wise models of high-performance multi-threaded implementations for the general matrix multiplication, triangular system solve with multiple right-hand sides, and symmetric rank-k update. Furthermore, these are then assembled to provide accurate models of the Cholesky factorization built on top of these Level-3 BLAS operations. Our models consider the costs, in terms of time and energy, of the floating-point operations involved in the routines as well as the overhead due to data movements across the levels of the memory hierarchy. The accuracy of the multi-threaded models is tested on an Intel Xeon E5-2620 processor, reporting relative errors for the Cholesky factorization that are, respectively, around 2.4 and 2.9 % on average for time and energy.

[1]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[2]  Eduard Ayguadé,et al.  Decomposable and responsive power models for multicore processors using performance counters , 2010, ICS '10.

[3]  Enrique S. Quintana-Ortí,et al.  Time and energy modeling of high-performance Level-3 BLAS on x86 architectures , 2015, Simul. Model. Pract. Theory.

[4]  Robert A. van de Geijn,et al.  BLIS : A Framework for Generating BLAS-like Libraries FLAME Working , 2012 .

[5]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[6]  Sally A. McKee,et al.  Portable, scalable, per-core power estimation for intelligent resource management , 2010, International Conference on Green Computing.

[7]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[8]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[9]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[10]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[11]  Tze Meng Low,et al.  Analytical Modeling Is Enough for High-Performance BLIS , 2016, ACM Trans. Math. Softw..

[12]  Gokcen Kestor,et al.  Quantifying the energy cost of data movement in scientific applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Gene H. Golub,et al.  Matrix computations , 1983 .

[14]  Francisco J. Cazorla,et al.  Hardware support for accurate per-task energy metering in multicore systems , 2013, TACO.

[15]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[16]  Rahul Khanna,et al.  RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).