PORPLE: An Extensible Optimizer for Portable Data Placement on GPU
暂无分享,去创建一个
Dong Li | Bo Wu | Guoyang Chen | Xipeng Shen
[1] Dong Li,et al. Exploring hybrid memory for GPU energy efficiency through software-hardware co-design , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[2] Bo Wu,et al. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.
[3] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI.
[4] Alan Jay Smith,et al. On the effectiveness of set associative page mapping and its application to main memory management , 1976, ICSE '76.
[5] Martin L. Kersten,et al. Generic Database Cost Models for Hierarchical Memory Systems , 2002, VLDB.
[6] Kevin Skadron,et al. Dymaxion: Optimizing memory access patterns for heterogeneous systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[7] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[8] Gagan Agrawal,et al. An integer programming framework for optimizing shared memory use on GPUs , 2010, 2010 International Conference on High Performance Computing.
[9] Margaret Martonosi,et al. Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.
[10] Collin McCurdy,et al. Scalable Heterogeneous Computing (SHOC) Benchmark Suite, Version 0.8 , 2009 .
[11] Alan P. Batson,et al. Measurements of major locality phases in symbolic reference strings , 1976, SIGMETRICS '76.
[12] David R. Kaeli,et al. Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.
[13] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[14] Collin McCurdy,et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.
[15] Rudolf Eigenmann,et al. Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.
[16] Yao Zhang,et al. A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[17] Frank Mueller,et al. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.
[18] Ricardo Bianchini,et al. Page placement in hybrid memory systems , 2011, ICS '11.
[19] Chen Ding,et al. Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.
[20] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.
[21] David Padua,et al. Compile-time performance prediction of scientific programs , 2000 .
[22] Babak Falsafi,et al. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.
[23] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[24] K QureshiMoinuddin,et al. Scalable high performance main memory system using phase-change memory technology , 2009 .
[25] Xipeng Shen,et al. On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.
[26] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.
[27] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[28] Uday Bondhugula,et al. A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.
[29] Richard W. Vuduc,et al. A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.