Auto-Tuning for the Era of Relatively High Bandwidth Memory Architectures: A Discussion Based on an FDM Application

This paper focuses on 3D stacking memory for obtaining higher memory access bandwidth to establish relatively high data movement performance instead of increasing computations in the post-Moore era. A revolution in conventional algorithms to utilize higher bandwidths is required in this era. Algorithms with conventional greedy-computational ability (FLOPS) have to be restructured to obtain algorithms with greedy-data movement ability (BYTES) in this era. Further, auto-tuning (AT) must also remain a crucial technology for the era of BYTES. As evidence supporting this argument, we choose a memory-bounded application from the Finite Difference Method (FDM) with AT facility to demonstrate the effectiveness of AT in a high bandwidth memory architecture. The Fujitsu PRIMEHPC FX10 and the Fujitsu PRIMEHPC FX100, machines with similar architectures but different memory bandwidth, are presented as evidence. The results of performance evaluation indicate an important aspect, when the Bytes per Flop (B/F) ratio is greater than 0.475, a high B/F kernel is faster than a low B/F kernel. The speedup factor in this case is not small, as it is as much as 4.47x. This indicates that the current trend of code optimization will change in the era of relatively high bandwidth memory architectures.