Cache coherence for GPU architectures
暂无分享,去创建一个
Mike O'Connor | Tor M. Aamodt | Wilson W. L. Fung | Inderpreet Singh | Arrvindh Shriraman | Mike O'Connor | Arrvindh Shriraman | Inderpreet Singh
[1] Andrew B. Kahng,et al. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.
[2] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.
[3] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[4] Christoforos E. Kozyrakis,et al. SCD: A scalable coherence directory with flexible sharer set encoding , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[5] Wen-Hann Wang,et al. On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.
[6] Jonathan Chang,et al. A 45 nm 8-Core Enterprise Xeon¯ Processor , 2010, IEEE J. Solid State Circuits.
[7] Michel Dubois,et al. Verification techniques for cache coherence protocols , 1997, CSUR.
[8] David L Weaver,et al. The SPARC architecture manual : version 9 , 1994 .
[9] Srinivas Devadas,et al. Library Cache Coherence , 2011 .
[10] Keshav Pingali,et al. A GPU implementation of inclusion-based points-to analysis , 2012, PPoPP '12.
[11] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[12] Robert Sims,et al. Alpha architecture reference manual , 1992 .
[13] Mike O'Connor,et al. Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.
[14] Sanjay J. Patel,et al. WAYPOINT: scaling coherence to thousand-core architectures , 2010, PACT '10.
[15] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.
[16] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[17] M. Hill,et al. Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.
[18] Sarita V. Adve,et al. Shared Memory Consistency Models: A Tutorial , 1996, Computer.
[19] Andrew S. Grimshaw,et al. Scalable GPU graph traversal , 2012, PPoPP '12.
[20] Sarita V. Adve,et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[21] Jeremy Manson,et al. The Java memory model , 2005, POPL '05.
[22] Vijayalakshmi Srinivasan,et al. SPATL: Honey, I Shrunk the Coherence Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[23] Jonathan Chang,et al. A 45 nm 8-Core Enterprise Xeon¯ Processor , 2009, IEEE Journal of Solid-State Circuits.
[24] Philippas Tsigas,et al. On dynamic load balancing on graphics processors , 2008, GH '08.
[25] Hans-Juergen Boehm,et al. Foundations of the C++ concurrency memory model , 2008, PLDI '08.
[26] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[27] Kenneth B. Kent,et al. The VTR project: architecture and CAD for FPGAs from verilog to routing , 2012, FPGA '12.
[28] Kunle Olukotun,et al. Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.
[29] Srinivas Devadas,et al. Memory coherence in the age of multicores , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).
[30] Alaa R. Alameldeen,et al. Timestamp snooping: an approach for extending SMPs , 2000, SIGP.
[31] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.
[32] John Giacomoni,et al. FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue , 2008, PPoPP.
[33] Sang Lyul Min,et al. Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps , 1992, IEEE Trans. Parallel Distributed Syst..
[34] Rami G. Melhem,et al. A timestamp-based selective invalidation scheme for multiprocessor cache coherence , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.
[35] Anoop Gupta,et al. Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.
[36] Milo M. K. Martin,et al. Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..
[37] Kunle Olukotun,et al. Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.
[38] Eric M. Schwarz,et al. IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..
[39] David A. Wood,et al. A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.
[40] David Seal,et al. ARM Architecture Reference Manual , 2001 .
[41] John R. Heath,et al. Coherency Hub Design for Multisocket Sun Servers with CoolThreads Technology , 2009, IEEE Micro.
[42] Sanjay J. Patel,et al. Cohesion: a hybrid memory model for accelerators , 2010, ISCA.
[43] Stefanos Kaxiras,et al. Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[44] Cathy May,et al. The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .
[45] Keshav Pingali,et al. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .
[46] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[47] P. J. Narayanan,et al. CUDA cuts: Fast graph cuts on the GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.
[48] Niraj K. Jha,et al. GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[49] Milo M. K. Martin,et al. Why on-chip cache coherence is here to stay , 2012, Commun. ACM.
[50] David A. Wood,et al. Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.
[51] Pat Conway,et al. The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.