Performance tuning on a single CPU is still an essential base for massively parallelized applications in the upcoming exascale era to achieve its potential performance against their peak. In this paper, we investigate room for performance improvement by searching possible memory layout optimization. The target application is a stencil computation and we use the roofline model as a performance model of it. The application analysis result of the roofline model and the performance analysis tool which we have been developing expects the performance improvement by conducting padding to the source code. Thus, we explore the appropriate memory layout which achieves the application performance improvement by bruteforce searching of randomly generated 1,000 patterns from possible padding parameters. The evaluation measuring the application performance on a single node shows that the application using the memory layout achieves 4.3 times faster than original.
[1]
Samuel Williams,et al.
Roofline: an insightful visual performance model for multicore architectures
,
2009,
CACM.
[2]
Tadao Nakamura,et al.
Identifying Program Loop Nesting Structures during Execution of Machine Code
,
2014,
IEICE Trans. Inf. Syst..
[3]
Pradeep Dubey,et al.
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
,
2015,
2012 39th Annual International Symposium on Computer Architecture (ISCA).
[4]
Tadao Nakamura,et al.
Whole program data dependence profiling to unveil parallel regions in the dynamic execution
,
2012,
2012 IEEE International Symposium on Workload Characterization (IISWC).