Performance Optimizations of Recursive Electronic Structure Solvers targeting Multi-Core Architectures (LA-UR-20-26665)

As we rapidly approach the frontiers of ultra large computing resources, software optimization is becoming of paramount interest to scientific application developers interested in efficiently leveraging all available on-Node computing capabilities and thereby improving a requisite science per watt metric. The scientific application of interest here is the Basic Math Library (BML) that provides a singular interface for linear algebra operation frequently used in the Quantum Molecular Dynamics (QMD) community. The provisioning of a singular interface indicates the presence of an abstraction layer which in-turn suggests commonalities in the code-base and therefore any optimization or tuning introduced in the core of code-base has the ability to positively affect the performance of the aforementioned library as a whole. With that in mind, we proceed with this investigation by performing a survey of the entirety of the BML code-base, and extract, in form of micro-kernels, common snippets of code. We introduce several optimization strategies into these micro-kernels including 1.) Strength Reduction 2.) Memory Alignment for large arrays 3.) Non Uniform Memory Access (NUMA) aware allocations to enforce data locality and 4.) appropriate thread affinity and bindings to enhance the overall multi-threaded performance. After introducing these optimizations, we benchmark the micro-kernels and compare the run-time before and after optimization for several target architectures. Finally we use the results as a guide to propagating the optimization strategies into the BML code-base. As a demonstration, herein, we test the efficacy of these optimization strategies by comparing the benchmark and optimized versions of the code.

[1]  Mikhail Smelyanskiy,et al.  Efficient backprojection-based synthetic aperture radar computation with many-core processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Andrey Vladimirov Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors: LU Decomposition of Small Matrices , 2015 .

[3]  Saman P. Amarasinghe,et al.  Strength Reduction of Integer Division and Modulo Operations , 2001, LCPC.

[4]  Wei Wang,et al.  Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[5]  Emanuel H. Rubensson,et al.  Linear Scaling Pseudo Fermi-Operator Expansion for Fractional Occupation. , 2018, Journal of chemical theory and computation.

[6]  Christoph Lameter,et al.  NUMA (Non-Uniform Memory Access): An Overview , 2013, ACM Queue.

[7]  Pete Beckman,et al.  NUMA-AWARE DATA MANAGEMENT FOR NEUTRON CROSS SECTION DATA IN CONTINUOUS ENERGY MONTE CARLO NEUTRON TRANSPORT SIMULATION , 2021, EPJ Web of Conferences.

[8]  S. Goedecker Linear scaling electronic structure methods , 1999 .

[9]  Ken Kennedy,et al.  An algorithm for reduction of operator strength , 1977, Commun. ACM.

[10]  John Shalf,et al.  Programming Abstractions for Data Locality , 2014 .

[11]  Marc A. de Kruijf Compiler Construction , 1996, Lecture Notes in Computer Science.

[12]  Anders M.N. Niklasson Expansion algorithm for the density matrix , 2002 .

[13]  Matt Godbolt Optimizations in C++ Compilers , 2019, ACM Queue.

[14]  Suprio Ray,et al.  The Art of Efficient In-memory Query Processing on NUMA Systems: a Systematic Approach , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[15]  Christian F. A. Negre,et al.  The basic matrix library (BML) for quantum chemistry , 2018, The Journal of Supercomputing.

[16]  Margo Seltzer,et al.  Unexpected Performance of Intel® Optane™ DC Persistent Memory , 2020, IEEE Computer Architecture Letters.

[17]  Jim Jeffers Chapter 5 – Lots of Data (Vectors) , 2013 .

[18]  David Black-Schaffer,et al.  Modeling and optimizing NUMA effects and prefetching with machine learning , 2020, ICS.