Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead
暂无分享,去创建一个
Erik Hagersten | Stefanos Kaxiras | Alberto Ros | Konstantinos Koukos | Erik Hagersten | S. Kaxiras | Alberto Ros | K. Koukos
[1] Alan L. Cox,et al. TreadMarks: shared memory computing on networks of workstations , 1996 .
[2] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.
[3] Stefanos Kaxiras,et al. A new perspective for efficient virtual-cache coherence , 2013, ISCA.
[4] Antonio Robles,et al. Efficient TLB-Based Detection of Private Pages in Chip Multiprocessors , 2016, IEEE Transactions on Parallel and Distributed Systems.
[5] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.
[6] Luiz André Barroso,et al. Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[7] David A. Wood,et al. Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[8] Balaram Sinharoy,et al. POWER7: IBM's next generation server processor , 2010, 2009 IEEE Hot Chips 21 Symposium (HCS).
[9] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.
[10] Christina Freytag,et al. Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .
[11] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[12] Stefanos Kaxiras,et al. Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory , 2015, HPDC.
[13] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[14] David A. Wood,et al. Heterogeneous-race-free memory models , 2014, ASPLOS.
[15] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[16] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[17] M. Hill,et al. Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.
[18] Timothy G. Mattson,et al. OpenCL Programming Guide , 2011 .
[19] David A. Wood,et al. QuickRelease: A throughput-oriented approach to release consistency on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[20] Daniel J. Sorin,et al. Exploring memory consistency for massively-threaded throughput-oriented processors , 2013, ISCA.
[21] Mike O'Connor,et al. Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[22] Antonio Robles,et al. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[23] Onur Mutlu,et al. The Dirty-Block Index , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[24] Stefanos Kaxiras,et al. Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[25] Scott B. Baden,et al. Redefining the Role of the CPU in the Era of CPU-GPU Integration , 2012, IEEE Micro.
[26] David Black-Schaffer,et al. TLC: A tag-less cache for reducing dynamic first level cache energy , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[27] Bronis R. de Supinski,et al. Experiences with Achieving Portability across Heterogeneous Architectures , 2011 .
[28] Stefanos Kaxiras,et al. Callback: Efficient synchronization without invalidation with a directory just for spin-waiting , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[29] David A. Wood,et al. gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.
[30] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .
[31] Stefanos Kaxiras,et al. Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[32] Nikolaos Hardavellas,et al. Cashmere-VLM: Remote memory paging for software distributed shared memory , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.
[33] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.