Efficiently enforcing strong memory ordering in GPUs
暂无分享,去创建一个
[1] Josep Torrellas,et al. BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.
[2] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[3] Mikko H. Lipasti,et al. Atomic SC for simple in-order processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[4] Scott A. Mahlke,et al. Mascar: Speeding up GPU warps by reducing memory pitstops , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[5] Mike O'Connor,et al. Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.
[6] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.
[7] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[8] M. Hill,et al. Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.
[9] Sarita V. Adve,et al. An evaluation of memory consistency models for shared-memory systems with ILP processors , 1996, ASPLOS VII.
[10] Michel Dubois,et al. Memory access buffering in multiprocessors , 1998, ISCA '98.
[11] Satish Narayanasamy,et al. End-to-end sequential consistency , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[12] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[13] Mike O'Connor,et al. Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[14] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[15] Kunle Olukotun,et al. Programming with transactional coherence and consistency (TCC) , 2004, ASPLOS XI.
[16] Margaret Martonosi,et al. MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[17] Daniel J. Sorin,et al. Exploring memory consistency for massively-threaded throughput-oriented processors , 2013, ISCA.
[18] Mike O'Connor,et al. Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[19] Muli Ben-Yehuda,et al. IOMMU: strategies for mitigating the IOTLB bottleneck , 2010, ISCA'10.
[20] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .
[21] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.
[22] Наталія Ігорівна Муліна,et al. Programming language C , 2013 .
[23] Mark D. Hill,et al. Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.
[24] Thomas F. Wenisch,et al. Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.
[25] Dennis Shasha,et al. Efficient and correct execution of parallel programs that share memory , 1988, TOPL.
[26] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[27] Niraj K. Jha,et al. GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[28] Ganesh Gopalakrishnan,et al. GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.
[29] Happy Gogoi. PROGRAMMING LANGUAGE C , 2015 .
[30] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.
[31] Babak Falsafi,et al. Speculative sequential consistency with little custom storage , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.
[32] Satish Narayanasamy,et al. zFENCE: Data-less Coherence for Efficient Fences , 2015, ICS.
[33] David A. Wood,et al. Heterogeneous-race-free memory models , 2014, ASPLOS.
[34] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[35] Gil Neiger,et al. Intel ® Virtualization Technology for Directed I/O , 2006 .
[36] Sarita V. Adve,et al. Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.
[37] ShashaDennis,et al. Efficient and correct execution of parallel programs that share memory , 1988 .
[38] Aaftab Munshi,et al. The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).
[39] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.
[40] T. N. Vijaykumar,et al. Is SC + ILP = RC? , 1999, ISCA.
[41] Rajiv Gupta,et al. Efficient sequential consistency via conflict ordering , 2012, ASPLOS XVII.
[42] Thomas F. Wenisch,et al. InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.
[43] Anoop Gupta,et al. Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.
[44] Antonio Robles,et al. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[45] Lifan Xu,et al. Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).