Memory and Parallelism Tuning Exploration using the LULESH Proxy Application

Current and planned computer systems present challenges for scientific programming. Memory capacity and bandwidth are limiting performance as floating point capability increases due to more cores per processor and wider vector units. Effective use of the new hardware requires finding greater parallelism while using relatively less memory. In this poster, we present how we tuned the Livermore Unstructured Lagrange Explicit Shock Hydrodynamics proxy application for on-node performance resulting in 62% fewer memory reads, a 19% smaller memory footprint, 77% more floating point operations vectorizing and less than 0.1% serial section runtime. Tests show decreased serial runtime of up to 57% and parallel runtime reductions of up to 75%. We are applying these optimizations to GPUs and a subset of ALE3D, from which the proxy application was derived. So far we achieve up to a 1.9x speedup on GPUs, and a 13% runtime reduction in the actual application for the same problem.

[1]  Martin Schulz,et al.  Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.