A Software-Managed Approach to Die-Stacked DRAM
暂无分享,去创建一个
[1] Daniel J. Sorin,et al. UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.
[2] Margaret Martonosi,et al. Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[3] Carla Schlatter Ellis,et al. Evaluation of NUMA Memory Management Through Modeling and Measurements , 1992, IEEE Trans. Parallel Distributed Syst..
[4] Kern Koh,et al. Lazy TLB consistency for large-scale multiprocessors , 1997, Proceedings of IEEE International Symposium on Parallel Algorithms Architecture Synthesis.
[5] Douglas J. Joseph,et al. Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[6] James E. Smith,et al. Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).
[7] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.
[8] Hsien-Hsin S. Lee,et al. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.
[9] Margaret Martonosi,et al. Inter-core cooperative TLB for chip multiprocessors , 2010, ASPLOS XV.
[10] Alan L. Cox,et al. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.
[11] Alan Jay Smith,et al. Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.
[12] Mark D. Hill,et al. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[13] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.
[14] Gabriel H. Loh,et al. A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[15] Li Zhao,et al. Exploring DRAM cache architectures for CMP server platforms , 2007, 2007 25th International Conference on Computer Design.
[16] Babak Falsafi,et al. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.
[17] Jean-Loup Baer,et al. The Impact of Timeliness for Hardware-based Prefetching from Main Memory , 1997 .
[18] Michael L. Scott,et al. Simple but effective techniques for NUMA memory management , 1989, SOSP '89.
[19] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.
[20] Mahmut T. Kandemir,et al. Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[21] Margaret Martonosi,et al. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.
[22] Hsien-Hsin S. Lee,et al. 3D-MAPS: 3D Massively parallel processor with stacked memory , 2012, 2012 IEEE International Solid-State Circuits Conference.
[23] Paul Hudak,et al. Memory coherence in shared virtual memory systems , 1989, TOCS.
[24] John S. Liptay,et al. Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..
[25] James R. Larus,et al. Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.
[26] Willy Zwaenepoel,et al. Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.
[27] Mark D. Hill,et al. Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap , 2012, IEEE Micro.
[28] Thomas R. Gross,et al. Memory system performance in a NUMA multicore multiprocessor , 2011, SYSTOR '11.
[29] Anoop Gupta,et al. Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.
[30] Tim Brecht,et al. On the importance of parallel application placement in NUMA multiprocessors , 1993 .
[31] Yan Solihin,et al. CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.
[32] Gabriel H. Loh,et al. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[33] Yangdong Deng,et al. Interconnect characteristics of 2.5-D system integration scheme , 2001, ISPD '01.
[34] Jacob Nelson,et al. Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX Annual Technical Conference.
[35] James R. Larus,et al. Tempest and typhoon: user-level shared memory , 1994, ISCA '94.
[36] Natalie D. Enright Jerger,et al. A dual grain hit-miss detector for large Die-Stacked DRAM caches , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[37] J. ContiC.,et al. Structural aspects of the system/360 model 85 , 1968 .
[38] David L. Black,et al. Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.
[39] Anoop Gupta,et al. Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.
[40] Ricardo Bianchini,et al. Page placement in hybrid memory systems , 2011, ICS '11.
[41] David Blaauw,et al. Centip3De: A 3930DMIPS/W configurable near-threshold 3D stacked system with 64 ARM Cortex-M3 cores , 2012, 2012 IEEE International Solid-State Circuits Conference.
[42] Avi Mendelson,et al. DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[43] Alexandra Fedorova,et al. A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[44] Trevor N. Mudge,et al. A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.
[45] Vivien Quéma,et al. Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.
[46] Willy Zwaenepoel,et al. Implementation and performance of Munin , 1991, SOSP '91.