Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity

In modern systems, DRAM-based main memory is signi?cantly slower than the processor.Consequently, processors spend a long time waiting to access data from main memory, makingthe long main memory access latency one of the most critical bottlenecks to achieving highsystem performance. Unfortunately, the latency of DRAM has remained almost constant inthe past decade. This is mainly because DRAM has been optimized for cost-per-bit, ratherthan access latency. As a result, DRAM latency is not reducing with technology scaling, andcontinues to be an important performance bottleneck in modern and future systems.This dissertation seeks to achieve low latency DRAM-based memory systems at low costin three major directions. The key idea of these three major directions is to enable and ex-ploit latency heterogeneity in DRAM architecture. First, based on the observation that longbitlines in DRAM are one of the dominant sources of DRAM latency, we propose a newDRAM architecture, Tiered-Latency DRAM (TL-DRAM), which divides the long bitline intotwo shorter segments using an isolation transistor, allowing one segment to be accessed withreduced latency. Second, we propose a ?ne-grained DRAM latency reduction mechanism,Adaptive-Latency DRAM, which optimizes DRAM latency for the common operating conditions for individual DRAM module. We observe that DRAM manufacturers incorporate a very large timing margin as a provision against the worst-case operating conditions, whichis accessing the slowest cell across all DRAM products with the worst latency at the highesttemperature, even though such a slowest cell and such an operating condition are rare. Ourmechanism dynamically optimizes DRAM latency to the current operating condition of theaccessed DRAM module, thereby reliably improving system performance. Third, we observethat cells closer to the peripheral logic can be much faster than cells farther from the peripherallogic (a phenomenon we call architectural variation). Based on this observation, we propose anew technique, Architectural-Variation-Aware DRAM (AVA-DRAM), which reduces DRAMlatency at low cost, by pro?ling and identifying only the inherently slower regions in DRAMto dynamically determine the lowest latency DRAM can operate at without causing failures.This dissertation provides a detailed analysis of DRAM latency by using both circuit-levelsimulation with a detailed DRAM model and FPGA-based pro?ling of real DRAM modules.Our latency analysis shows that our low latency DRAM mechanisms enable significant latencyreductions, leading to large improvement in both system performance and energy e?fficiencyacross a variety of workloads in our evaluated systems, while ensuring reliable DRAM operation.

[1]  Onur Mutlu,et al.  Improving cache performance using read-write partitioning , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[2]  T. Hamamoto,et al.  On the retention time distribution of dynamic random access memory (DRAM) , 1998 .

[3]  Onur Mutlu,et al.  ERRoR ANAlysIs AND RETENTIoN-AwARE ERRoR MANAgEMENT FoR NAND FlAsh MEMoRy , 2013 .

[4]  Brent Keeth,et al.  DRAM Circuit Design: A Tutorial , 2000 .

[5]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[6]  D. Yaney,et al.  A meta-stable leakage phenomenon in DRAM charge storage —Variable hold time , 1987, 1987 International Electron Devices Meeting.

[7]  Kieran McLaughlin,et al.  An RLDRAM II Implementation of a 10Gbps Shared Packet Buffer for Network Processing , 2007, Second NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2007).

[8]  Xin Li,et al.  A Memory Soft Error Measurement on Production Systems , 2007, USENIX Annual Technical Conference.

[9]  Onur Mutlu,et al.  Address-value delta (AVD) prediction: increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[10]  Vijayalakshmi Srinivasan,et al.  Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Jae-Kyung Wee,et al.  A post-package bit-repair scheme using static latches with bipolar-voltage programmable antifuse circuit for high-density DRAMs , 2002 .

[12]  Bong-Seok Han,et al.  Adaptive Self Refresh Scheme for Battery Operated High-Density Mobile DRAM Applications , 2006, 2006 IEEE Asian Solid-State Circuits Conference.

[13]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[14]  Shimeng Yu,et al.  Metal–Oxide RRAM , 2012, Proceedings of the IEEE.

[15]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[16]  Zhao Zhang,et al.  Software thermal management of dram memory for multicore systems , 2008, SIGMETRICS '08.

[17]  Onur Mutlu,et al.  Prefetch-aware shared-resource management for multi-core systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[18]  Gennady Pekhimenko,et al.  Toggle-Aware Bandwidth Compression for GPUs , 2015 .

[19]  Chita R. Das,et al.  A heterogeneous multiple network-on-chip design: An application-aware approach , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[20]  Rachata Ausavarungnirun,et al.  Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[21]  Trevor Mudge,et al.  Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[22]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[23]  Mahmut T. Kandemir,et al.  Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Y. Mori,et al.  The origin of variable retention time in DRAM , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[25]  David A. Wood,et al.  Adaptive cache compression for high-performance processors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[26]  O Seongil,et al.  Reducing memory access latency with asymmetric DRAM bank organizations , 2013, ISCA.

[27]  John Paul Shen,et al.  Mitigating Amdahl's law through EPI throttling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[28]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[29]  Anna R. Karlin,et al.  A study of integrated prefetching and caching strategies , 1995, SIGMETRICS '95/PERFORMANCE '95.

[30]  Carlos Alvarez Martinez,et al.  Dynamic Tolerance Region Computing for Multimedia , 2012, IEEE Transactions on Computers.

[31]  Jian Huang,et al.  Exploiting basic block value locality with block reuse , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[32]  Fred Douglis,et al.  The Compression Cache: Using On-line Compression to Extend Physical Memory , 1993, USENIX Winter.

[33]  Onur Mutlu,et al.  ChargeCache: Reducing DRAM latency by exploiting row access locality , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[34]  Onur Mutlu,et al.  Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[35]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[36]  Feng Lin,et al.  DRAM circuit design , 2000 .

[37]  Young-Hyun Jun,et al.  1.2V 1.6Gb/s 56nm 6F2 4Gb DDR3 SDRAM with hybrid-I/O sense amplifier and segmented sub-array architecture , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[38]  Youyou Lu,et al.  Loose-Ordering Consistency for persistent memory , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[39]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[40]  Rajeev Balasubramonian,et al.  Interconnect design considerations for large NUCA caches , 2007, ISCA '07.

[41]  Kinam Kim,et al.  A New Investigation of Data Retention Time in Truly Nanoscaled DRAMs , 2009, IEEE Electron Device Letters.

[42]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[43]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[44]  Osman S. Unsal,et al.  Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[45]  Onur Mutlu,et al.  A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[46]  Onur Mutlu,et al.  DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems , 2010 .

[47]  Wen-mei W. Hwu,et al.  Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.

[48]  Jae-Kyung Wee,et al.  An antifuse EPROM circuitry scheme for field-programmable repair in DRAM , 2000, IEEE Journal of Solid-State Circuits.

[49]  Chita R. Das,et al.  Application-aware prioritization mechanisms for on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[50]  Jun Yang,et al.  Phase-Change Technology and the Future of Main Memory , 2010, IEEE Micro.

[51]  Shih-Hung Chen,et al.  Phase-change random access memory: A scalable technology , 2008, IBM J. Res. Dev..

[52]  Zhao Zhang,et al.  Cached DRAM for ILP Processor Memory Access Latency Reduction , 2001, IEEE Micro.

[53]  Norbert Wehn,et al.  Embedded DRAM Development: Technology, Physical Design, and Application Issues , 2001, IEEE Des. Test Comput..

[54]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[55]  Chia-Lin Yang,et al.  Improving DRAM latency with dynamic asymmetric subarray , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[56]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[57]  Yang Xiao,et al.  Low power memristor-based ReRAM design with Error Correcting Code , 2012, 17th Asia and South Pacific Design Automation Conference.

[58]  Onur Mutlu,et al.  MISE: Providing performance predictability and improving fairness in shared main memory systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[59]  Carlos Alvarez,et al.  On the potential of tolerant region reuse for multimedia applications , 2001, ICS '01.

[60]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[61]  Jongmoo Choi,et al.  WARM: Improving NAND flash memory lifetime with write-hotness aware retention management , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[62]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[63]  G.S. Sohi,et al.  Dynamic instruction reuse , 1997, ISCA '97.

[64]  R.-L. Jiang,et al.  Guardband determination for the detection of off-state and junction leakages in DRAM testing , 2001, Proceedings 10th Asian Test Symposium.

[65]  James E. Smith,et al.  Performance Of Cached Dram Organizations In Vector Supercomputers , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[66]  Onur Mutlu,et al.  The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[67]  Onur Mutlu,et al.  Improving DRAM performance by parallelizing refreshes with accesses , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[68]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[69]  Onur Mutlu,et al.  Improving memory Bank-Level Parallelism in the presence of prefetching , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[70]  Yannis Smaragdakis,et al.  The Case for Compressed Caching in Virtual Memory Systems , 1999, USENIX Annual Technical Conference, General Track.

[71]  Jong-Ho Kang,et al.  A 1.2V 23nm 6F2 4Gb DDR3 SDRAM with local-bitline sense amplifier, hybrid LIO sense amplifier and dummy-less array architecture , 2012, 2012 IEEE International Solid-State Circuits Conference.

[72]  Tajana Simunic,et al.  PDRAM: A hybrid PRAM and DRAM main memory system , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[73]  Ricardo Bianchini,et al.  Page placement in hybrid memory systems , 2011, ICS '11.

[74]  Onur Mutlu,et al.  The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[75]  Onur Mutlu,et al.  The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study , 2014, SIGMETRICS '14.

[76]  Tze Meng Low,et al.  3 D-Stacked Memory-Side Acceleration : Accelerator and System Design , 2014 .

[77]  Wongyu Shin,et al.  Multiple Clone Row DRAM: A low latency and area optimized DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[78]  Onur Mutlu,et al.  Rollback-free value prediction with approximate loads , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[79]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[80]  Frederick A. Ware,et al.  Improving Power and Data Efficiency with Threaded Memory Modules , 2006, 2006 International Conference on Computer Design.

[81]  Chita R. Das,et al.  Aérgia: exploiting packet latency slack in on-chip networks , 2010, ISCA.

[82]  Bruce R. Childers,et al.  COMeT: Continuous Online Memory Test , 2011, 2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing.

[83]  Onur Mutlu,et al.  A QoS-Enabled On-Die Interconnect Fabric for Kilo-Node Chips , 2012, IEEE Micro.

[84]  Onur Mutlu,et al.  Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation , 2013, ICCD.

[85]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[86]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[87]  Mahmut T. Kandemir,et al.  Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[88]  Costas J. Spanos,et al.  Modeling within-die spatial correlation effects for process-design co-optimization , 2005, Sixth international symposium on quality electronic design (isqed'05).

[89]  Onur Mutlu,et al.  The Blacklisting Memory Scheduler: Balancing Performance, Fairness and Complexity , 2015, ArXiv.

[90]  Sai Prashanth Muralidhara,et al.  Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[91]  Wen-mei W. Hwu,et al.  Compiler-directed dynamic computation reuse: rationale and initial results , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[92]  Sherif M. Sharroush,et al.  Dynamic random-access memories without sense amplifiers , 2012, Elektrotech. Informationstechnik.

[93]  S. Stall,et al.  Improving Cache Performance by Exploiting Read-Write Disparity , 2014 .

[94]  Ad J. van de Goor,et al.  Address and data scrambling: causes and impact on memory tests , 2002, Proceedings First IEEE International Workshop on Electronic Design, Test and Applications '2002.

[95]  Rajeev Balasubramonian,et al.  MemZip: Exploring unconventional benefits from memory compression , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[96]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[97]  Reetuparna Das,et al.  Design and Evaluation of Hierarchical Rings with Deflection Routing , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[98]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[99]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[100]  Zhen Fang,et al.  Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[101]  J. W. Park,et al.  DRAM variable retention time , 1992, 1992 International Technical Digest on Electron Devices Meeting.

[102]  Dongwoo Kang,et al.  Amnesic cache management for non-volatile memory , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[103]  Shuai Li,et al.  High-Performance and Lightweight Transaction Support in Flash-Based SSDs , 2015, IEEE Transactions on Computers.

[104]  Harold S. Stone,et al.  A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.

[105]  Scott A. Mahlke,et al.  Composite Cores: Pushing Heterogeneity Into a Core , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[106]  Onur Mutlu,et al.  Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[107]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[108]  Amandeep Singh,et al.  Software based in-system memory test for highly available systems , 2005, 2005 IEEE International Workshop on Memory Technology, Design, and Testing (MTDT'05).

[109]  Hubertus Franke,et al.  Memory Expansion Technology (MXT): Software support and performance , 2001, IBM J. Res. Dev..

[110]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[111]  Onur Mutlu,et al.  A Case for Effic ient Hardware/Soft ware Cooperative Management of Storage and Memory , 2013 .

[112]  Onur Mutlu,et al.  Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.

[113]  Wongyu Shin,et al.  NUAT: A non-uniform access time memory controller , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[114]  Mahmut T. Kandemir,et al.  Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[115]  Osman S. Unsal,et al.  Neighbor-cell assisted error correction for MLC NAND flash memories , 2014, SIGMETRICS '14.

[116]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[117]  Charles A. Hart CDRAM in a unified memory architecture , 1994, Proceedings of COMPCON '94.

[118]  J. Torrellas,et al.  VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects , 2008, IEEE Transactions on Semiconductor Manufacturing.

[119]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[120]  Onur Mutlu,et al.  RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads , 2016, ACM Trans. Archit. Code Optim..

[121]  Jing Li,et al.  A case for small row buffers in non-volatile main memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[122]  Song Liu,et al.  Hardware/software techniques for DRAM thermal management , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[123]  Onur Mutlu,et al.  Base-Delta-Immediate Compression: A Practical Data Compression Mechanism for On-Chip Caches , 2012 .

[124]  Onur Mutlu,et al.  Adaptive-latency DRAM: Optimizing DRAM timing for the common-case , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[125]  James E. Smith,et al.  Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[126]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[127]  Tao Zhang,et al.  Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[128]  Onur Mutlu,et al.  Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems , 2007, USENIX Security Symposium.

[129]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[130]  Chris Fallin,et al.  Memory power management via dynamic voltage/frequency scaling , 2011, ICAC '11.

[131]  Onur Mutlu,et al.  Accelerating read mapping with FastHASH , 2013, BMC Genomics.

[132]  Reetuparna Das,et al.  Application-to-core mapping policies to reduce memory system interference in multi-core systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[133]  Larry Rudolph,et al.  Accelerating multi-media processing by implementing memoing in multiplication and division units , 1998, ASPLOS VIII.

[134]  Onur Mutlu,et al.  PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[135]  Hideto Hidaka,et al.  The cache DRAM architecture: a DRAM with an on-chip cache memory , 1990, IEEE Micro.

[136]  Jinseok Lee,et al.  Bit line coupling scheme and electrical fuse circuit for reliable operation of high density DRAM , 2001, 2001 Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No.01CH37185).

[137]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[138]  Alair Pereira do Lago,et al.  Adaptive compressed caching: design and implementation , 2003, Proceedings. 15th Symposium on Computer Architecture and High Performance Computing.

[139]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[140]  David A. Wood,et al.  Interactions Between Compression and Prefetching in Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[141]  Chris Fallin,et al.  Next generation on-chip networks: what kind of congestion control do we need? , 2010, Hotnets-IX.

[142]  Roy T. Fielding,et al.  The Apache HTTP Server Project , 1997, IEEE Internet Comput..

[143]  Onur Mutlu,et al.  Coordinated control of multiple prefetchers in multi-core systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[144]  Onur Mutlu,et al.  An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms , 2013, ISCA.

[145]  Onur Mutlu,et al.  Threshold voltage distribution in MLC NAND flash memory: Characterization, analysis, and modeling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[146]  Onur Mutlu,et al.  Exploiting compressed block size as an indicator of future reuse , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[147]  Thomas F. Wenisch,et al.  Memory persistency , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[148]  Onur Mutlu,et al.  Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..

[149]  Jongmoo Choi,et al.  ThyNVM: Enabling software-transparent crash consistency in persistent memory systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[150]  Onur Mutlu,et al.  Runahead Execution: An Effective Alternative to Large Instruction Windows , 2003, IEEE Micro.

[151]  O. Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.

[152]  J. E. Thornton,et al.  Parallel operation in the control data 6600 , 1964, AFIPS '64 (Fall, part II).

[153]  S. Phadke,et al.  MLP aware heterogeneous memory system , 2011, 2011 Design, Automation & Test in Europe.

[154]  Huiyang Zhou,et al.  Enhancing memory-level parallelism via recovery-free value prediction , 2005, IEEE Transactions on Computers.

[155]  Onur Mutlu,et al.  AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[156]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[157]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[158]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[159]  Onur Mutlu,et al.  The Dirty-Block Index , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[160]  David W. Nellans,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[161]  Zhao Zhang,et al.  Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[162]  Onur Mutlu,et al.  Phase change memory architecture and the quest for scalability , 2010, Commun. ACM.

[163]  Srinivasan Seshan,et al.  On-chip networks from a networking perspective: congestion and scalability in many-core interconnects , 2012, SIGCOMM '12.

[164]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[165]  A. Ueno,et al.  A 16-Mbit DRAM with a relaxed sense-amplifier-pitch open-bit-line architecture , 1988 .

[166]  Anoop Gupta,et al.  Scheduling and page migration for multiprocessor compute servers , 1994, ASPLOS VI.

[167]  Thomas Vogelsang,et al.  Understanding the Energy Consumption of Dynamic Random Access Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[168]  Onur Mutlu,et al.  Data retention in MLC NAND flash memory: Characterization, optimization, and recovery , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[169]  Jie Liu,et al.  Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[170]  Onur Mutlu,et al.  Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management , 2012, IEEE Computer Architecture Letters.

[171]  André Seznec,et al.  Zero-content augmented caches , 2009, ICS '09.

[172]  Onur Mutlu,et al.  Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[173]  Onur Mutlu,et al.  The evicted-address filter: A unified mechanism to address both cache pollution and thrashing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[174]  Onur Mutlu,et al.  Linearly compressed pages: A low-complexity, low-latency main memory compression framework , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[175]  Meng-Fan Chang,et al.  Low Store Energy, Low VDDmin, 8T2R Nonvolatile Latch and SRAM With Vertical-Stacked Resistive Memory (Memristor) Devices for Low Power Mobile Applications , 2012, IEEE Journal of Solid-State Circuits.

[176]  Qiang Wu,et al.  Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[177]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[178]  Jin Sun,et al.  Managing Hybrid Main Memories with a Page-Utility Driven Performance Model , 2015, ArXiv.

[179]  Chris Fallin,et al.  Parallel application memory scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[180]  Jung Ho Ahn,et al.  A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[181]  Hongzhong Zheng,et al.  Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling , 2014 .

[182]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[183]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[184]  Chris Fallin,et al.  The heterogeneous block architecture , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[185]  Onur Mutlu,et al.  Research Problems and Opportunities in Memory Systems , 2014, Supercomput. Front. Innov..

[186]  Onur Mutlu,et al.  Gather-Scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[187]  Onur Mutlu,et al.  Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses , 2006, IEEE Transactions on Computers.

[188]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[189]  Onur Mutlu,et al.  Mitigating the Memory Bottleneck With Approximate Load Value Prediction , 2016, IEEE Design & Test.

[190]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[191]  Onur Mutlu,et al.  Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, ISCA.

[192]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[193]  Richard C. Foss,et al.  High-speed, high-reliability circuit design for megabit DRAM , 1991 .

[194]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[195]  Onur Mutlu,et al.  Page overlays: An enhanced virtual memory framework to enable fine-grained memory management , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[196]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[197]  Feng Lin,et al.  DRAM Circuit Design: Fundamental and High-Speed Topics , 2007 .

[198]  Onur Mutlu,et al.  Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance , 2006, IEEE Micro.

[199]  J.Y. Lee,et al.  Simultaneously formed storage node contact and metal contact cell (SSMC) for 1 Gb DRAM and beyond , 1996, International Electron Devices Meeting. Technical Digest.

[200]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[201]  Hiroyuki Kobayashi,et al.  Fast cycle RAM (FCRAM); a 20-ns random row access, pipe-lined operating DRAM , 1998, 1998 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.98CH36215).

[202]  D. Tavangarian,et al.  Automatic on-line memory tests in workstations , 1994, Proceedings of IEEE International Workshop on Memory Technology, Design, and Test.

[203]  Franziska Hoffmann,et al.  Design Of Analog Cmos Integrated Circuits , 2016 .

[204]  Jun Yang,et al.  Frequent Value Locality and Value-Centric Data Cache Design , 2000, ASPLOS.

[205]  Kevin Kai-Wei Chang,et al.  DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators , 2016, ACM Trans. Archit. Code Optim..

[206]  Jung Ho Ahn,et al.  Multicore DIMM: an Energy Efficient Memory Module with Independently Controlled DRAMs , 2009, IEEE Computer Architecture Letters.

[207]  Onur Mutlu,et al.  Distributed order scheduling and its application to multi-core dram controllers , 2008, PODC '08.

[208]  Xiang Li,et al.  Thermal managerment of high power memory module for server platforms , 2008, 2008 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems.

[209]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[210]  Onur Mutlu,et al.  FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[211]  M. Togo,et al.  64 Mb 6.8 ns random ROW access DRAM macro for ASICs , 1999, 1999 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC. First Edition (Cat. No.99CH36278).

[212]  André Seznec,et al.  Exploiting Single-Usage for Effective Memory Management , 2007, Asia-Pacific Computer Systems Architecture Conference.

[213]  Hiroki Koike,et al.  A 30-ns 64-Mb DRAM with built-in self-test and self-repair function , 1992 .

[214]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[215]  Mor Harchol-Balter,et al.  ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[216]  Onur Mutlu,et al.  Tiered-latency DRAM: A low latency and low cost DRAM architecture , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[217]  Onur Mutlu,et al.  Utility-based acceleration of multithreaded applications on asymmetric CMPs , 2013, ISCA.

[218]  K. Gotoh,et al.  Technique for controlling effective Vth in multi-Gbit DRAM sense amplifier , 1996, 1996 Symposium on VLSI Circuits. Digest of Technical Papers.

[219]  Onur Mutlu,et al.  Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization , 2016, SIGMETRICS.

[220]  Maurice V. Wilkes,et al.  The memory gap and the future of high performance memories , 2001, CARN.

[221]  S. Narasimha,et al.  22nm High-performance SOI technology featuring dual-embedded stressors, Epi-Plate High-K deep-trench embedded DRAM and self-aligned Via 15LM BEOL , 2012, 2012 International Electron Devices Meeting.

[222]  Mahmut T. Kandemir,et al.  Exploiting Core Criticality for Enhanced GPU Performance , 2016, SIGMETRICS.

[223]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[224]  Jesús Corbal,et al.  Dynamic Tolerance Region Computing for Multimedia , 2012, IEEE Trans. Computers.

[225]  K.J. Nesbit,et al.  AC/DC: an adaptive data cache prefetcher , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[226]  Onur Mutlu,et al.  Techniques for efficient processing in runahead execution engines , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[227]  Kevin Kai-Wei Chang,et al.  HAT: Heterogeneous Adaptive Throttling for On-Chip Networks , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[228]  H.-S. Philip Wong,et al.  Phase Change Memory , 2010, Proceedings of the IEEE.

[229]  R. Symanczyk,et al.  Conductive bridging RAM (CBRAM): an emerging non-volatile memory technology scalable to sub 20nm , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[230]  Mithuna Thottethodi,et al.  Self-tuned congestion control for multiprocessor networks , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[231]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[232]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[233]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.

[234]  Dirk Grunwald,et al.  A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[235]  Onur Mutlu,et al.  Data marshaling for multi-core architectures , 2010, ISCA.

[236]  Onur Mutlu,et al.  Fast Bulk Bitwise AND and OR in DRAM , 2015, IEEE Computer Architecture Letters.

[237]  Dean M. Tullsen,et al.  Hardware identification of cache conflict misses , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[238]  B J Smith,et al.  A pipelined, shared resource MIMD computer , 1986 .

[239]  Doris Schmitt-Landsiedel,et al.  DRAM Yield Analysis and Optimization by a Statistical Design Approach , 2011, IEEE Transactions on Circuits and Systems I: Regular Papers.

[240]  Onur Mutlu,et al.  Preemptive Virtual Clock: A flexible, efficient, and cost-effective QOS scheme for networks-on-chip , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[241]  Onur Mutlu,et al.  Express Cube Topologies for on-Chip Interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[242]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[243]  Andrew F. Glew MLP yes! ILP no , 1998, ASPLOS 1998.

[244]  Bianca Schroeder,et al.  Temperature management in data centers: why some (might) like it hot , 2012, SIGMETRICS '12.

[245]  Kevin Zhang,et al.  2nd generation embedded DRAM with 4X lower self refresh power in 22nm Tri-Gate CMOS technology , 2014, 2014 Symposium on VLSI Circuits Digest of Technical Papers.

[246]  Rei-Fu Huang,et al.  Fault models for embedded-DRAM macros , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[247]  Norbert Wehn,et al.  Exploiting expendable process-margins in DRAMs for run-time performance optimization , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[248]  Onur Mutlu,et al.  Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks , 2014, ACM Trans. Archit. Code Optim..

[249]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[250]  Jose-Maria Arnau,et al.  Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[251]  Reetuparna Das,et al.  A case for hierarchical rings with deflection routing: An energy-efficient on-chip communication substrate , 2016, Parallel Comput..

[252]  Jongmoo Choi,et al.  Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[253]  C. Alkan,et al.  Fast and accurate mapping of Complete Genomics reads. , 2015, Methods.

[254]  Onur Mutlu,et al.  Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[255]  Onur Mutlu,et al.  Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories , 2014, ACM Trans. Archit. Code Optim..

[256]  David A. Wood,et al.  Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches , 2004 .

[257]  S. Narasimha,et al.  High Performance 45-nm SOI Technology with Enhanced Strain, Porous Low-k BEOL, and Immersion Lithography , 2006, 2006 International Electron Devices Meeting.

[258]  Karthik Ramani,et al.  Microarchitectural wire management for performance and power in partitioned architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[259]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[260]  Christoforos E. Kozyrakis,et al.  Improving System Energy Efficiency with Memory Rank Subsetting , 2012, TACO.