论文信息 - Architecting high-performance, efficient, and scalable heterogeneous memory systems with 3D-DRAM

Architecting high-performance, efficient, and scalable heterogeneous memory systems with 3D-DRAM

[1] Mark A. Holliday,et al. Reference history, page size, and migration daemons in local/remote architectures , 1989, ASPLOS III.

[2] Alberto Ros,et al. PS directory: a scalable multilevel directory cache for CMPs , 2014, The Journal of Supercomputing.

[3] Yen-Chen Liu,et al. Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[4] Jeffrey B. Rothman,et al. Sector cache design and performance , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[5] Josep Torrellas,et al. Cache-Only Memory Architectures , 1999, Computer.

[6] Amitabha Roy,et al. ALLARM: Optimizing sparse directories for thread-local data , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7] Anoop Gupta,et al. The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[8] David A. Wood,et al. A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[9] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[10] William J. Dally,et al. Scatter-add in data parallel architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[11] Carole-Jean Wu,et al. SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12] Lieven Eeckhout,et al. Power-aware multi-core simulation for early design stage hardware/software co-optimization , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[14] Darryl Gove,et al. CPU2006 working set size , 2007, CARN.

[15] Mikko H. Lipasti,et al. Tag tables , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[16] Mark D. Hill,et al. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17] D.A. Wood,et al. Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[18] Philip Machanick,et al. Hardware-software trade-offs in a direct Rambus implementation of the RAMpage memory hierarchy , 1998, ASPLOS VIII.

[19] So-Ra Kim,et al. 8Gb 3D DDR3 DRAM using through-silicon-via technology , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[20] Luiz André Barroso,et al. Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[21] Shih-Hung Chen,et al. Phase-change random access memory: A scalable technology , 2008, IBM J. Res. Dev..

[22] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.

[23] Christoforos E. Kozyrakis,et al. SCD: A scalable coherence directory with flexible sharer set encoding , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[24] Brian Rogers,et al. Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[25] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[26] Samira Manabi Khan,et al. Sampling Dead Block Prediction for Last-Level Caches , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[27] Anoop Gupta,et al. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[28] David H. Bailey,et al. The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[29] Aamer Jaleel,et al. BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[30] Gabriel H. Loh,et al. A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[31] Philip G. Emma,et al. Cache miss behavior: is it sqrt(2)? , 2006 .

[32] Erich Strohmaier,et al. Quantifying Locality In The Memory Access Patterns of HPC Applications , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[33] Aamer Jaleel,et al. Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[34] Mark Horowitz,et al. An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[35] Aamer Jaleel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[36] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[37] Stephen W. Keckler,et al. Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[38] Henry G. Dietz,et al. Improving cache performance by selective cache bypass , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[39] Onur Mutlu,et al. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[40] Andreas Moshovos. RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[41] Babak Falsafi,et al. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[42] José González,et al. A two-level directory architecture for highly scalable cc-NUMA multiprocessors , 2005, IEEE Transactions on Parallel and Distributed Systems.

[43] Gabriel H. Loh,et al. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[44] Zhen Fang,et al. Highly efficient synchronization based on active memory operations , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[45] Mor Harchol-Balter,et al. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[46] Lizy Kurian John,et al. Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47] R. Manikantan,et al. Bi-Modal DRAM Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, MICRO 2014.

[48] Gabriel H. Loh,et al. Challenges in Heterogeneous Die-Stacked and Off-Chip Memory Systems , 2012 .

[49] Katsuyuki Sakuma,et al. Three-dimensional silicon integration , 2008, IBM J. Res. Dev..

[50] Peter J. Denning,et al. Properties of the working-set model , 1972, CACM.

[51] James R. Goodman,et al. Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[52] Aamer Jaleel,et al. Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[53] Babak Falsafi,et al. Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[54] Benjamin C. Lee,et al. REF: resource elasticity fairness with sharing incentives for multiprocessors , 2014, ASPLOS.

[55] Daniel A. Jiménez. Insertion and promotion for tree-based PseudoLRU last-level caches , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[56] Brad Calder,et al. Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[57] Janak H. Patel,et al. A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[58] Matthias A. Blumrich,et al. Design and implementation of the blue gene/P snoop filter , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[59] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[60] John L. Henning. SPEC CPU2006 memory footprint , 2007, CARN.

[61] Milo M. K. Martin,et al. Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[62] Thomas Vogelsang,et al. Understanding the Energy Consumption of Dynamic Random Access Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[63] Kevin Tran,et al. The era of high bandwidth memory , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[64] Sangyeun Cho,et al. Stash directory: A scalable directory for many-core coherence , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[65] Mikko H. Lipasti,et al. Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[66] Antonio Robles,et al. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[67] Daniel Sánchez,et al. Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[68] John L. Henning. SPEC CPU2006 benchmark descriptions , 2006, CARN.

[69] Wen-mei W. Hwu,et al. Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[70] Kevin M. Lepak,et al. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[71] Kaushik Roy,et al. Reducing set-associative cache energy via way-prediction and selective direct-mapping , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[72] Stijn Eyerman,et al. An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[73] Babak Falsafi,et al. Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[74] Thomas F. Wenisch,et al. Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[75] Euiseong Seo,et al. Empirical Analysis on Energy Efficiency of Flash-based SSDs , 2008, HotPower.

[76] Vijayalakshmi Srinivasan,et al. A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[77] Alaa R. Alameldeen,et al. Transparent Hardware Management of Stacked DRAM as Part of Memory , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[78] Anoop Gupta,et al. Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[79] Babak Falsafi,et al. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[80] Henry Hoffmann,et al. Remote Store Programming , 2010, HiPEAC.

[81] David Roberts,et al. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[82] Avinash Sodani,et al. Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).