Optimizing performance on modern HPC systems: learning from simple kernel benchmarks

We discuss basic optimization and parallelization strategies for current cache-based microprocessors (Intel Itanium2, Intel Netburst and AMD64 variants) in single-CPU and shared memory environments. Using selected kernel benchmarks representing data intensive applications we focus on the effective bandwidths attainable, which is still suboptimal using current compilers. We stress the need for a subtle OpenMP implementation even for simple benchmark programs, to exploit the high aggregate memory bandwidth available nowadays on ccNUMA systems. If the quality of main memory access is the measure, classical vector systems such as the NEC SX6+ are still a class of their own and are able to sustain the performance level of in-cache operations of modern microprocessors even with arbitrarily large data sets.