Data-centric combinatorial optimization of parallel code

Memory performance is one essential factor for tapping into the full potential of the massive parallelism of GPU. It has motivated some recent efforts in GPU cache modeling. This paper presents a new data-centric way to model the performance of a system with heterogeneous memory resources. The new model is composable, meaning it can predict the performance difference due to placing data differently by profiling the execution just once.

[1]  Chen Ding,et al.  All-window profiling of concurrent executions , 2008, PPoPP.

[2]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[3]  J DenningPeter The working set model for program behavior , 1968 .

[4]  Peter J. Denning,et al.  The working set model for program behavior , 1968, CACM.

[5]  Hao Luo,et al.  HOTL: a higher order theory of locality , 2013, ASPLOS '13.

[6]  Dong Li,et al.  PORPLE: An Extensible Optimizer for Portable Data Placement on GPU , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Chen Ding,et al.  All-window profiling and composable models of cache sharing , 2011, PPoPP '11.