Efficient utilization of GPGPU cache hierarchy
暂无分享,去创建一个
[1] Naga K. Govindaraju,et al. Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[2] Hyesoon Kim,et al. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[3] André Seznec,et al. A case for two-way skewed-associative caches , 1993, ISCA '93.
[4] José González,et al. The design and performance of a conflict-avoiding cache , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.
[5] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[6] Emmett Kilgariff,et al. Fermi GF100 GPU Architecture , 2011, IEEE Micro.
[7] Jaejin Lee,et al. Using prime numbers for cache indexing to eliminate conflict misses , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).
[8] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[9] B. Ramakrishna Rau,et al. Pseudo-randomly interleaved memory , 1991, ISCA '91.
[10] James C. Hoe,et al. Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[11] Mateo Valero,et al. Eliminating cache conflict misses through XOR-based placement functions , 1997, ICS '97.
[12] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[13] Duncan H. Lawrie,et al. The Prime Memory System for Array Access , 1982, IEEE Transactions on Computers.
[14] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.
[15] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[16] Alan Jay Smith,et al. Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.
[17] Collin McCurdy,et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.
[18] Lifan Xu,et al. Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).
[19] Xuhao Chen,et al. Adaptive Cache Bypass and Insertion for Many-core Accelerators , 2014, MES '14.
[20] Mikko H. Lipasti,et al. Adaptive Cache and Concurrency Allocation on GPGPUs , 2015, IEEE Computer Architecture Letters.
[21] Kevin Skadron,et al. A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..
[22] John D. Owens,et al. GPU Computing , 2008, Proceedings of the IEEE.
[23] Dong active st century Li. Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processors , 2014 .
[24] Yale N. Patt,et al. The V-Way cache: demand-based associativity via global replacement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).
[25] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[26] Kevin Skadron,et al. Accelerating Compute-Intensive Applications with GPUs and FPGAs , 2008, 2008 Symposium on Application Specific Processors.
[27] Mateo Valero,et al. Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[28] Xuhao Chen,et al. Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[29] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[30] David T. Harper,et al. Vector Access Performance in Parallel Memories Using a Skewed Storage Scheme , 1987, IEEE Transactions on Computers.
[31] Mike O'Connor,et al. Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[32] John Kim,et al. Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[33] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.
[34] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..
[35] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[36] Margaret Martonosi,et al. MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).