Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence
暂无分享,去创建一个
[1] Josep Torrellas,et al. BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.
[2] Srinivas Devadas,et al. Tardis 2.0: Optimized time traveling coherence for relaxed consistency models , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[3] Thomas F. Wenisch,et al. InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.
[4] Bob Bentley,et al. Validating the Intel(R) Pentium(R) 4 microprocessor , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).
[5] Randy H. Katz,et al. Verifying a multiprocessor cache controller using random test generation , 1990, IEEE Design & Test of Computers.
[6] Sarita V. Adve,et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[7] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.
[8] Babak Falsafi,et al. Speculative sequential consistency with little custom storage , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.
[9] Keshav Pingali,et al. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .
[10] Alaa R. Alameldeen,et al. Timestamp snooping: an approach for extending SMPs , 2000, SIGP.
[11] Anoop Gupta,et al. Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.
[12] Anoop Gupta,et al. Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.
[13] Peter Sewell,et al. A Better x86 Memory Model: x86-TSO , 2009, TPHOLs.
[14] Srinivas Devadas,et al. TARDIS: Timestamp based Coherence Algorithm for Distributed Shared Memory , 2015, ArXiv.
[15] Alan J. Hu,et al. Protocol verification as a hardware design aid , 1992, Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors.
[16] T. N. Vijaykumar,et al. Is SC + ILP = RC? , 1999, ISCA.
[17] Rajiv Gupta,et al. Efficient sequential consistency via conflict ordering , 2012, ASPLOS XVII.
[18] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.
[19] Srinivas Devadas,et al. Library Cache Coherence , 2011 .
[20] David L Weaver,et al. The SPARC architecture manual : version 9 , 1994 .
[21] David A. Wood,et al. Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[22] Eric M. Schwarz,et al. IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..
[23] Satish Narayanasamy,et al. zFENCE: Data-less Coherence for Efficient Fences , 2015, ICS.
[24] Balaram Sinharoy,et al. POWER5 system microarchitecture , 2005, IBM J. Res. Dev..
[25] Sarita V. Adve,et al. Efficient GPU synchronization without scopes: Saying no to complex consistency models , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[26] Leslie Lamport,et al. Time, clocks, and the ordering of events in a distributed system , 1978, CACM.
[27] Niraj K. Jha,et al. In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.
[28] Satish Narayanasamy,et al. End-to-end sequential consistency , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[29] David A. Wood,et al. A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.
[30] Philippas Tsigas,et al. On dynamic load balancing on graphics processors , 2008, GH '08.
[31] Srinivas Devadas,et al. A Proof of Correctness for the Tardis Cache Coherence Protocol , 2015, ArXiv.
[32] Sebastian Burckhardt,et al. Verifying Safety of a Token Coherence Implementation by Parametric Compositional Refinement , 2005, VMCAI.
[33] Michel Dubois,et al. Scalable Shared Memory Multiprocessors , 1992, Springer US.
[34] Niraj K. Jha,et al. GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[35] Satish Narayanasamy,et al. Efficiently enforcing strong memory ordering in GPUs , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[36] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[37] Ganesh Gopalakrishnan,et al. GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.
[38] David A. Wood,et al. QuickRelease: A throughput-oriented approach to release consistency on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[39] Sarita V. Adve,et al. Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.
[40] Hans-Juergen Boehm,et al. Foundations of the C++ concurrency memory model , 2008, PLDI '08.
[41] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[42] Kenneth B. Kent,et al. The VTR project: architecture and CAD for FPGAs from verilog to routing , 2012, FPGA '12.
[43] Michel Dubois,et al. Verifying Distributed Directory-Based Cahce Coherence Protocols: S3.mp, a Case Study , 1995, Euro-Par.
[44] Wenzhi Chen,et al. Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures , 2016, ICS.
[45] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[46] Jie Cheng,et al. CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..
[47] Mikko H. Lipasti,et al. Atomic SC for simple in-order processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[48] Kenneth C. Yeager. The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.
[49] Jade Alglave,et al. Understanding POWER multiprocessors , 2011, PLDI '11.
[50] Thomas F. Wenisch,et al. Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.
[51] Kunle Olukotun,et al. Programming with transactional coherence and consistency (TCC) , 2004, ASPLOS XI.
[52] Snehasish Kumar,et al. Fusion: Design tradeoffs in coherent cache hierarchies for accelerators , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[53] Albert Meixner,et al. Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.
[54] David A. Wood,et al. Heterogeneous-race-free memory models , 2014, ASPLOS.
[55] Somesh Jha,et al. Verification of the Futurebus+ cache coherence protocol , 1993, Formal Methods Syst. Des..
[56] M. Hill,et al. Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.
[57] Mikko H. Lipasti,et al. The complexity of verifying memory coherence , 2003, SPAA '03.
[58] Jeremy Manson,et al. The Java memory model , 2005, POPL '05.
[59] Michel Dubois,et al. Memory access buffering in multiprocessors , 1998, ISCA '98.
[60] Andrew B. Kahng,et al. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.
[61] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[62] Daniel J. Sorin,et al. Exploring memory consistency for massively-threaded throughput-oriented processors , 2013, ISCA.
[63] Mike O'Connor,et al. Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).