Architecting high-performance, efficient, and scalable heterogeneous memory systems with 3D-DRAM

[1]  Mark A. Holliday,et al.  Reference history, page size, and migration daemons in local/remote architectures , 1989, ASPLOS III.

[2]  Alberto Ros,et al.  PS directory: a scalable multilevel directory cache for CMPs , 2014, The Journal of Supercomputing.

[3]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[4]  Jeffrey B. Rothman,et al.  Sector cache design and performance , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[5]  Josep Torrellas,et al.  Cache-Only Memory Architectures , 1999, Computer.

[6]  Amitabha Roy,et al.  ALLARM: Optimizing sparse directories for thread-local data , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[8]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[9]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[10]  William J. Dally,et al.  Scatter-add in data parallel architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[11]  Carole-Jean Wu,et al.  SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Lieven Eeckhout,et al.  Power-aware multi-core simulation for early design stage hardware/software co-optimization , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[14]  Darryl Gove,et al.  CPU2006 working set size , 2007, CARN.

[15]  Mikko H. Lipasti,et al.  Tag tables , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[16]  Mark D. Hill,et al.  Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  D.A. Wood,et al.  Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[18]  Philip Machanick,et al.  Hardware-software trade-offs in a direct Rambus implementation of the RAMpage memory hierarchy , 1998, ASPLOS VIII.

[19]  So-Ra Kim,et al.  8Gb 3D DDR3 DRAM using through-silicon-via technology , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[20]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[21]  Shih-Hung Chen,et al.  Phase-change random access memory: A scalable technology , 2008, IBM J. Res. Dev..

[22]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[23]  Christoforos E. Kozyrakis,et al.  SCD: A scalable coherence directory with flexible sharer set encoding , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[24]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[25]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[26]  Samira Manabi Khan,et al.  Sampling Dead Block Prediction for Last-Level Caches , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[27]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[28]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[29]  Aamer Jaleel,et al.  BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[30]  Gabriel H. Loh,et al.  A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[31]  Philip G. Emma,et al.  Cache miss behavior: is it sqrt(2)? , 2006 .

[32]  Erich Strohmaier,et al.  Quantifying Locality In The Memory Access Patterns of HPC Applications , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[33]  Aamer Jaleel,et al.  Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[34]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[35]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[36]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[37]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[38]  Henry G. Dietz,et al.  Improving cache performance by selective cache bypass , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[39]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[40]  Andreas Moshovos RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[41]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[42]  José González,et al.  A two-level directory architecture for highly scalable cc-NUMA multiprocessors , 2005, IEEE Transactions on Parallel and Distributed Systems.

[43]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[44]  Zhen Fang,et al.  Highly efficient synchronization based on active memory operations , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[45]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[46]  Lizy Kurian John,et al.  Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  R. Manikantan,et al.  Bi-Modal DRAM Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, MICRO 2014.

[48]  Gabriel H. Loh,et al.  Challenges in Heterogeneous Die-Stacked and Off-Chip Memory Systems , 2012 .

[49]  Katsuyuki Sakuma,et al.  Three-dimensional silicon integration , 2008, IBM J. Res. Dev..

[50]  Peter J. Denning,et al.  Properties of the working-set model , 1972, CACM.

[51]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[52]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[53]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[54]  Benjamin C. Lee,et al.  REF: resource elasticity fairness with sharing incentives for multiprocessors , 2014, ASPLOS.

[55]  Daniel A. Jiménez Insertion and promotion for tree-based PseudoLRU last-level caches , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[56]  Brad Calder,et al.  Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[57]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[58]  Matthias A. Blumrich,et al.  Design and implementation of the blue gene/P snoop filter , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[59]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[60]  John L. Henning SPEC CPU2006 memory footprint , 2007, CARN.

[61]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[62]  Thomas Vogelsang,et al.  Understanding the Energy Consumption of Dynamic Random Access Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[63]  Kevin Tran,et al.  The era of high bandwidth memory , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[64]  Sangyeun Cho,et al.  Stash directory: A scalable directory for many-core coherence , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[65]  Mikko H. Lipasti,et al.  Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[66]  Antonio Robles,et al.  Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[67]  Daniel Sánchez,et al.  Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[68]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[69]  Wen-mei W. Hwu,et al.  Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[70]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[71]  Kaushik Roy,et al.  Reducing set-associative cache energy via way-prediction and selective direct-mapping , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[72]  Stijn Eyerman,et al.  An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[73]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[74]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[75]  Euiseong Seo,et al.  Empirical Analysis on Energy Efficiency of Flash-based SSDs , 2008, HotPower.

[76]  Vijayalakshmi Srinivasan,et al.  A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[77]  Alaa R. Alameldeen,et al.  Transparent Hardware Management of Stacked DRAM as Part of Memory , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[78]  Anoop Gupta,et al.  Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[79]  Babak Falsafi,et al.  Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[80]  Henry Hoffmann,et al.  Remote Store Programming , 2010, HiPEAC.

[81]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[82]  Avinash Sodani,et al.  Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).