Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors

Hardware platforms in high performance computing are constantly getting more complex to handle even when considering multicore CPUs alone. Numerous features and configuration options in the hardware and the software environment that are relevant for performance are not even known to most application users or developers. Microbenchmarks, i.e., simple codes that fathom a particular aspect of the hardware, can help to shed light on such issues, but only if they are well understood and if the results can be reconciled with known facts or performance models. The insight gained from microbenchmarks may then be applied to real applications for performance analysis or optimization. In this paper we investigate two modern Intel x86 server CPU architectures in depth: Broadwell EP and Cascade Lake SP. We highlight relevant hardware configuration settings that can have a decisive impact on code performance and show how to properly measure on-chip and off-chip data transfer bandwidths. The new victim L3 cache of Cascade Lake and its advanced replacement policy receive due attention. Finally we use DGEMM, sparse matrix-vector multiplication, and the HPCG benchmark to make a connection to relevant application scenarios.

[1]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[2]  Gerhard Wellein,et al.  A Recursive Algebraic Coloring Technique for Hardware-efficient Symmetric Sparse Matrix-vector Multiplication , 2019, ACM Trans. Parallel Comput..

[3]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[4]  Courtenay T. Vaughan,et al.  Evaluating the Intel Skylake Xeon Processor for HPC Workloads , 2018, 2018 International Conference on High Performance Computing & Simulation (HPCS).

[5]  Gerhard Wellein,et al.  High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations , 2015, J. Comput. Phys..

[6]  Gerhard Wellein,et al.  Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs , 2020, ISC.

[7]  Subhash Saini,et al.  Performance Evaluation of an Intel Haswell-and Ivy Bridge-Based Supercomputer Using Scientific and Engineering Applications , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[8]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[9]  Gerhard Wellein,et al.  likwid-bench: An Extensible Microbenchmarking Platform for x86 Multicore Compute Nodes , 2011, Parallel Tools Workshop.

[10]  Constantine Bekas,et al.  Stochastic Matrix-Function Estimators: Scalable Big-Data Kernels with High Performance , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[12]  Georg Hager,et al.  On the accuracy and usefulness of analytic energy models for contemporary multicore processors , 2018, ISC.

[13]  C. W. Glass,et al.  Performance Modeling of the HPCG Benchmark , 2014, PMBS@SC.

[14]  David E. Keyes,et al.  Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations , 2015, ACM Trans. Parallel Comput..

[15]  Subhash Saini,et al.  Performance Evaluation of Intel Broadwell Nodes Based Supercomputer Using Computational Fluid Dynamics and Climate Applications , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications Workshops (HPCCWS).

[16]  C. T. Vaughan,et al.  Evaluating the Marvell ThunderX2 Server Processor for HPC Workloads , 2019, 2019 International Conference on High Performance Computing & Simulation (HPCS).

[17]  A. Y. Suhov An Accurate Polynomial Approximation of Exponential Integrators , 2014, J. Sci. Comput..

[18]  Robert Schöne,et al.  Main memory and cache performance of intel sandy bridge and AMD bulldozer , 2014, MSPC@PLDI.

[19]  Simon McIntosh-Smith,et al.  A performance analysis of the first generation of HPC‐optimized Arm processors , 2019, Concurr. Comput. Pract. Exp..

[20]  Gerhard Wellein,et al.  An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors , 2017, ISC.

[21]  Gerhard Wellein,et al.  Analysis of Intel's Haswell Microarchitecture Using the ECM Model and Microbenchmarks , 2016, ARCS.