Architectural Techniques to Enhance DRAM Scaling

For decades,mainmemory has enjoyed the continuous scaling of its physical substrate: DRAM(DynamicRandomAccessMemory). But now,DRAMscaling has reached a thresholdwhereDRAMcells cannot bemade smaller without jeopardizing their robustness. This thesis identifies two specific challenges to DRAM scaling, and presents architectural techniques to overcome them. First, DRAMcells are becoming less reliable. AsDRAMprocess technology scales down to smaller dimensions, it is more likely for DRAM cells to electrically interfere with each other’s operation. We confirm this by exposing the vulnerability of the latest DRAM chips to a reliability problem called disturbance errors. By reading repeatedly from the same cell in DRAM, we show that it is possible to corrupt the data stored in nearby cells. We demonstrate this phenomenon on Intel and AMD systems using a malicious program that generates many DRAM accesses. We provide an extensive characterization of the errors, as well as their behavior, using a custom-built testing platform. After examining various potential ways of addressing the problem, we propose a low-overhead solution that effectively prevents the errors through a collaborative effort between the DRAM chips and the DRAM controller. Second, DRAM cells are becoming slower due to worsening variation in DRAMprocess technology. To alleviate the latency bottleneck, we propose to unlock fine-grained parallelism within a DRAM chip so that many accesses can be served at the same time. We take a close look at how a DRAM chip is internally organized, and find that it is divided

[1]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[2]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[3]  Robert H. Morris,et al.  Counting large numbers of events in small registers , 1978, CACM.

[4]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[5]  H. Hidaka,et al.  A Twisted Bit Line Technique for Multi-Mb Drams , 1988, 1988 IEEE International Solid-State Circuits Conference, 1988 ISSCC. Digest of Technical Papers.

[6]  Y. Konishi,et al.  Analysis of coupling noise between adjacent bit lines in megabit DRAMs , 1989 .

[7]  Dong-Sun Min Dong-Sun Min,et al.  Wordline coupling noise reduction techniques for scaled DRAMs , 1990, Digest of Technical Papers., 1990 Symposium on VLSI Circuits.

[8]  Hideto Hidaka,et al.  The cache DRAM architecture: a DRAM with an on-chip cache memory , 1990, IEEE Micro.

[9]  Hiroki Koike,et al.  A 30-ns 64-Mb DRAM with built-in self-test and self-repair function , 1992 .

[10]  James E. Smith,et al.  Performance Of Cached Dram Organizations In Vector Supercomputers , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[11]  Charles A. Hart CDRAM in a unified memory architecture , 1994, Proceedings of COMPCON '94.

[12]  Jean-Loup BaerDepartment DRAM Caching , 1997 .

[13]  Gershon Kedem,et al.  WCDRAM: A fully associative integrated Cached-DRAM with wide cache lines , 1997 .

[14]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[15]  Kunle Olukotun,et al.  The hierarchical multi-bank DRAM: a high-performance architecture for memory integrated with processors , 1997, Proceedings Seventeenth Conference on Advanced Research in VLSI.

[16]  Hiroyuki Kobayashi,et al.  Fast cycle RAM (FCRAM); a 20-ns random row access, pipe-lined operating DRAM , 1998, 1998 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.98CH36215).

[17]  Industrial evaluation of DRAM tests , 1999, Design, Automation and Test in Europe Conference and Exhibition, 1999. Proceedings (Cat. No. PR00078).

[18]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[19]  Chenming Hu,et al.  Impact of gate-induced drain leakage current on the tail distribution of DRAM data retention time , 2000, International Electron Devices Meeting 2000. Technical Digest. IEDM (Cat. No.00CH37138).

[20]  Feng Lin,et al.  DRAM circuit design , 2000 .

[21]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[22]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[23]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[24]  Zhao Zhang,et al.  Cached DRAM for ILP Processor Memory Access Latency Reduction , 2001, IEEE Micro.

[25]  Young-Hyun Jun,et al.  Conditional-capture flip-flop for statistical power reduction , 2001, IEEE J. Solid State Circuits.

[26]  Bruce F. Cockburn,et al.  An investigation into crosstalk noise in DRAM structures , 2002, Proceedings of the 2002 IEEE International Workshop on Memory Technology, Design and Testing (MTDT2002).

[27]  Robert H. Dennard,et al.  Challenges and future directions for the scaling of dynamic random-access memory (DRAM) , 2002, IBM J. Res. Dev..

[28]  Ad J. van de Goor,et al.  Address and data scrambling: causes and impact on memory tests , 2002, Proceedings First IEEE International Workshop on Electronic Design, Test and Applications '2002.

[29]  Saibal Mukhopadhyay,et al.  Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits , 2003, Proc. IEEE.

[30]  Roberto Bez,et al.  Introduction to flash memory , 2003, Proc. IEEE.

[31]  Mason L. Williams,et al.  Cross-track noise profile measurement for adjacent-track interference study and write-current optimization in perpendicular recording , 2003 .

[32]  Yossi Matias,et al.  Spectral bloom filters , 2003, SIGMOD '03.

[33]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[34]  유쿠타케세이고,et al.  A semiconductor memory , 2004 .

[35]  Kaushik Roy,et al.  Modeling and testing of SRAM for new failure mechanisms due to process variations in nanoscale CMOS , 2005, 23rd IEEE VLSI Test Symposium (VTS'05).

[36]  Zaid Al-Ars DRAM fault analysis and test generation , 2005 .

[37]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[38]  Eric Rotenberg,et al.  Retention-aware placement in DRAM (RAPID): software methods for quasi-non-volatile DRAM , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[39]  James E. Smith,et al.  Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[40]  Said Hamdioui,et al.  DRAM-Specific Space of Memory Tests , 2006, 2006 IEEE International Test Conference.

[41]  Dong Tang,et al.  Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[42]  Kiyoo Itoh,et al.  Vlsi Memory Chip Design , 2006 .

[43]  Frederick A. Ware,et al.  Improving Power and Data Efficiency with Threaded Memory Modules , 2006, 2006 International Conference on Computer Design.

[44]  Onur Mutlu,et al.  Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems , 2007, USENIX Security Symposium.

[45]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[46]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[47]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[48]  Aamer Jaleel,et al.  Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[49]  Feng Lin,et al.  DRAM Circuit Design: Fundamental and High-Speed Topics , 2007 .

[50]  J. Zhu,et al.  Understanding Adjacent Track Erasure in Discrete Track Media , 2008, IEEE Transactions on Magnetics.

[51]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[52]  Zhao Zhang,et al.  Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[53]  Jung Ho Ahn,et al.  A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[54]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[55]  Jung Ho Ahn,et al.  Multicore DIMM: an Energy Efficient Memory Module with Independently Controlled DRAMs , 2009, IEEE Computer Architecture Letters.

[56]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[57]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[58]  Benjamin Van Durme,et al.  Probabilistic Counting with Randomized Storage , 2009, IJCAI.

[59]  Peter Gregorius,et al.  75nm 7Gb/s/pin 1Gb GDDR5 graphics memory device with bandwidth-improvement techniques , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[60]  C. Svensson,et al.  Improvement Potential and Equalization Example for Multidrop DRAM Memory Buses , 2009, IEEE Transactions on Advanced Packaging.

[61]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[62]  A. Kavcic,et al.  The Feasibility of Magnetic Recording at 10 Terabits Per Square Inch on Conventional Media , 2009, IEEE Transactions on Magnetics.

[63]  Borivoje Nikolic,et al.  Large-Scale SRAM Variability Characterization in 45 nm CMOS , 2009, IEEE Journal of Solid-State Circuits.

[64]  Rei-Fu Huang,et al.  Fault models for embedded-DRAM macros , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[65]  Tor M. Aamodt,et al.  Complexity effective memory access scheduling for many-core accelerator architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[66]  Shi-Jie Wen,et al.  New DRAM HCI qualification method emphasizing on repeated memory access , 2010, 2010 IEEE International Integrated Reliability Workshop Final Report.

[67]  Thomas Vogelsang,et al.  Understanding the Energy Consumption of Dynamic Random Access Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[68]  Onur Mutlu,et al.  DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems , 2010 .

[69]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[70]  David W. Nellans,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[71]  Norman P. Jouppi,et al.  Rethinking DRAM design and organization for energy-constrained multi-cores , 2010, ISCA.

[72]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[73]  Lizy Kurian John,et al.  The virtual write queue: coordinating DRAM and last-level cache policies , 2010, ISCA.

[74]  Michael J. Miller Bandwidth engine® serial memory chip breaks 2 billion accesses/sec , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[75]  Bradford M. Beckmann,et al.  The gem5 simulator , 2011, CARN.

[76]  Sai Prashanth Muralidhara,et al.  Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[77]  Masashi Horiguchi,et al.  Nanoscale Memory Repair , 2011, Integrated Circuits and Systems.

[78]  Balaram Sinharoy,et al.  IBM POWER7 multicore server processor , 2011 .

[79]  Chris Fallin,et al.  Parallel application memory scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[80]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[81]  David Blaauw,et al.  Variation-aware static and dynamic writability analysis for voltage-scaled bit-interleaved 8-T SRAMs , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[82]  Christoforos E. Kozyrakis,et al.  Improving System Energy Efficiency with Memory Rank Subsetting , 2012, TACO.

[83]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[84]  Rei-Fu Huang,et al.  Alternate hammering test for application-specific DRAMs and an industrial case study , 2012, DAC Design Automation Conference 2012.

[85]  Norman P. Jouppi,et al.  Staged Reads: Mitigating the impact of DRAM writes on DRAM reads , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[86]  Dam Sunwoo,et al.  Balancing DRAM locality and parallelism in shared memory CMP systems , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[87]  Seth H. Pugsley,et al.  USIMM : the Utah SImulated Memory Module , 2012 .

[88]  Lei Liu,et al.  A software memory partition approach for eliminating bank-level interference in multicore systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[89]  Matthew Poremba,et al.  NVMain: An Architectural-Level Main Memory Simulator for Emerging Non-volatile Memories , 2012, 2012 IEEE Computer Society Annual Symposium on VLSI.

[90]  S. Narasimha,et al.  22nm High-performance SOI technology featuring dual-embedded stressors, Epi-Plate High-K deep-trench embedded DRAM and self-aligned Via 15LM BEOL , 2012, 2012 International Electron Devices Meeting.

[91]  Onur Mutlu,et al.  Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[92]  Onur Mutlu,et al.  A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[93]  Björn Andersson,et al.  Coordinated Bank and Cache Coloring for Temporal Protection of Memory Accesses , 2013, 2013 IEEE 16th International Conference on Computational Science and Engineering.

[94]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[95]  Rachata Ausavarungnirun,et al.  RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data , 2013 .

[96]  O Seongil,et al.  McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[97]  Onur Mutlu,et al.  An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms , 2013, ISCA.

[98]  Onur Mutlu,et al.  Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation , 2013, ICCD.

[99]  Onur Mutlu,et al.  Tiered-latency DRAM: A low latency and low cost DRAM architecture , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[100]  Onur Mutlu,et al.  Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.

[101]  Onur Mutlu,et al.  The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study , 2014, SIGMETRICS '14.

[102]  Kevin Zhang,et al.  2nd generation embedded DRAM with 4X lower self refresh power in 22nm Tri-Gate CMOS technology , 2014, 2014 Symposium on VLSI Circuits Digest of Technical Papers.

[103]  Onur Mutlu,et al.  Improving DRAM performance by parallelizing refreshes with accesses , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[104]  O Seongil,et al.  Row-buffer decoupling: A case for low-latency DRAM microarchitecture , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[105]  Thomas F. Wenisch,et al.  Simulating DRAM controllers for future system architecture exploration , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[106]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[107]  Tao Zhang,et al.  Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[108]  Hongzhong Zheng,et al.  Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling , 2014 .

[109]  Onur Mutlu,et al.  Adaptive-latency DRAM: Optimizing DRAM timing for the common-case , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[110]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.