Software methods for improvement of cache performance on supercomputer applications

Measurements of actual supercomputer cache performance has not been previously undertaken. PFC-Sim is a program-driven event tracing facility that can simulate data cache performance of very long programs. PFC-Sim simulates cache concurrently with program execution, allowing very long traces to be used. Programs with traces in excess of 4 billion entries have been used to measure the performance of various cache structures. PFC-Sim was used to measure the cache performance of array references in a benchmark set of supercomputer applications, RiCEPS. Data cache hit ratios varied on average between 70% for a 16K cache and 91% for a 256K cache. Programs with very large working sets generate poor cache performance even with large caches. The hit ratios of individual references are measured to either 0% or 100%. By locating the references that miss, attempts to improve memory performance can focus on references where improvement is possible. The compiler can estimate the number of loop iterations which can execute without filling the cache, the overflow iteration. The overflow iteration combined with the dependence graph can be used to determine at each reference whether execution will result in hits or misses. Program transformation can be used to improve cache performance by reordering computation to move references to the same memory location closer together, thereby eliminating cache misses. Using the overflow iteration, the compiler can often do this transformation automatically. Standard blocking transformations cannot be used on many loop nests that contain transformation preventing dependences. Wavefront blocking allows any loop nest to be blocked, when the components of dependence vectors are bounded. When the cache misses cannot be eliminated, software prefetching can overlap the miss delays with computation. Software prefetching uses a special instruction to preload values into the cache. A cache load resembles a register load in structure, but does not block computation and only moves the address into cache where a later register load will be required. The compiler can inform the cache (on average) over 100 cycles before a load is required. Cache misses can be serviced in parallel with computation.