Performance characterization and validation of ASCI applications: A memory centric view

Performance and scalability of high performance scientific applications on large scale parallel machines are more dependent on the hierarchical memory subsystems of these machines than the peak instruction rate of the processors employed. The dependence is likely to increase in the future. While single-processor performance may double every eighteen months, memory bandwidth increases by only 15% during the same period. In addition, distributed shared memory (DSM) architectures are now being implemented which extend the concept of single-processor cache hierarchies across an entire physically-distributed multi-processor machine. Machines which will be available to the Department of Energy`s Accelerated Strategic Computing Initiative (ASCI) can have as many as 128 processors in a single DSM. Scalability of these machines to large numbers of processors is ultimately tied to issues of memory hierarchy performance, which includes data migration policies and distributed cache coherence protocols. Investigations of the performance improvements of applications over time and across new generations of machines must explicitly account for the effects of memory performance. In this paper, the authors characterize application performance with a memory-centric view. The applications are a representative part of the ASCI workload. Using a simple Mean Value Analysis (MVA) strategy and observed performance data, they infer the contribution of each level in the memory system to the application`s overall performance in cycles per instruction (CPI). Their empirical model accounts for the overlap of processor execution with memory accesses.