Architecting Memory Systems for Emerging Technologies

The advance of traditional dynamic random access memory (DRAM) technology has slowed down, while the capacity and performance needs of memory system have continued to increase. This is a result of increasing data volume from emerging applications, such as machine learning and big data analytics. In addition to such demands, increasing energy consumption is becoming a major constraint on the capabilities of computer systems. As a result, emerging non-volatile memories, for example, Spin Torque Transfer Magnetic RAM (STT-MRAM), and new memory interfaces, for example, High Bandwidth Memory (HBM), have been developed as an alternative. Thus far, most previous studies have retained a DRAM-like memory architecture and management policy. This preserves compatibility but hides the true benefits of those new memory technologies. In this research, we proposed the co-design of memory architectures and their management policies for emerging technologies. First, we introduced a new memory architecture for an STT-MRAM main memory. In particular, we defined a new page mode operation for efficient activation and sensing. By fully exploiting the non-destructive nature of STTMRAM, our design achieved higher performance, lower energy consumption, and a smaller area than the traditional designs. Second, we developed a cost-effective technique to improve load balancing for HBM memory channels. We showed that the proposed technique was capable of efficiently redistributing memory requests across multiple memory channels to improve the channel utilization, resulting in improved performance.

[1]  Masashi Horiguchi,et al.  A flexible redundancy technique for high-density DRAMs , 1991 .

[2]  John von Neumann,et al.  First draft of a report on the EDVAC , 1993, IEEE Annals of the History of Computing.

[3]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[4]  Soo-In Cho,et al.  A 32-bank 1 Gb DRAM with 1 GB/s bandwidth , 1996, 1996 IEEE International Solid-State Circuits Conference. Digest of TEchnical Papers, ISSCC.

[5]  Martin L. Kersten,et al.  Database Architecture Optimized for the New Bottleneck: Memory Access , 1999, VLDB.

[6]  H. Ikeda,et al.  High-speed DRAM architecture development , 1999 .

[7]  Trevor N. Mudge,et al.  A performance comparison of contemporary DRAM architectures , 1999, ISCA.

[8]  Nihar R. Mahapatra,et al.  The processor-memory bottleneck: problems and solutions , 1999, CROS.

[9]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[10]  Michael Mitzenmacher,et al.  The Power of Two Choices in Randomized Load Balancing , 2001, IEEE Trans. Parallel Distributed Syst..

[11]  Kyeong-Sik Min,et al.  A fast pump-down V/sub BB/ generator for sub-1.5-V DRAMs , 2001 .

[12]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[13]  Koen De Bosschere,et al.  XOR-based hash functions , 2005, IEEE Transactions on Computers.

[14]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[15]  Naga K. Govindaraju,et al.  GPGPU: general-purpose computation on graphics hardware , 2006, SC.

[16]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[17]  David Kirk,et al.  NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[18]  M. McDaniel,et al.  Prospective Memory: An Overview and Synthesis of an Emerging Field , 2007 .

[19]  Stephen C. Graves,et al.  Little's Law , 2008 .

[20]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[21]  박기태,et al.  Semiconductor memory device with three-dimensional array structure and repair method thereof , 2008 .

[22]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[23]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[24]  John Y. Chen,et al.  GPU technology trends and future requirements , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[25]  Luan Tran,et al.  45nm low power CMOS logic compatible embedded STT MRAM utilizing a reverse-connection 1T/1MTJ cell , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[26]  Arijit Raychowdhury,et al.  Design space and scalability exploration of 1T-1STT MTJ memory arrays in the presence of variability and disturbances , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[27]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[28]  Young-Hyun Jun,et al.  1.2V 1.6Gb/s 56nm 6F2 4Gb DDR3 SDRAM with hybrid-I/O sense amplifier and segmented sub-array architecture , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[29]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[30]  Tor M. Aamodt,et al.  Complexity effective memory access scheduling for many-core accelerator architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  J. Nowak,et al.  Switching distributions and write reliability of perpendicular spin torque MRAM , 2010, 2010 International Electron Devices Meeting.

[32]  Luca Benini,et al.  An efficient distributed memory interface for many-core platform with 3D stacked DRAM , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[33]  A. Driskill-Smith,et al.  Fully integrated 54nm STT-RAM with the smallest bit cell dimension for high density memory application , 2010, 2010 International Electron Devices Meeting.

[34]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[35]  David W. Nellans,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[36]  Norman P. Jouppi,et al.  Rethinking DRAM design and organization for energy-constrained multi-cores , 2010, ISCA.

[37]  Bruce Jacob,et al.  Fine-Grained Activation for Power Reduction in DRAM , 2010, IEEE Micro.

[38]  Yoshihiro Ueda,et al.  A 64Mb MRAM with clamped-reference and adequate-reference schemes , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[39]  Masashi Horiguchi,et al.  Nanoscale Memory Repair , 2011, Integrated Circuits and Systems.

[40]  Ki-Whan Song,et al.  A 58nm 1.8V 1Gb PRAM with 6.4MB/s program BW , 2011, 2011 IEEE International Solid-State Circuits Conference.

[41]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[42]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[43]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[44]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[45]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[46]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[47]  Jong-Ho Kang,et al.  A 1.2V 23nm 6F2 4Gb DDR3 SDRAM with local-bitline sense amplifier, hybrid LIO sense amplifier and dummy-less array architecture , 2012, 2012 IEEE International Solid-State Circuits Conference.

[48]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[49]  N. Shimomura,et al.  Impact of ultra low power and fast write operation of advanced perpendicular MTJ on power reduction for high-performance mobile CPU , 2012, 2012 International Electron Devices Meeting.

[50]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[51]  藤田 忍,et al.  Magnetic random access memory and a memory system , 2012 .

[52]  Qi Wang,et al.  A 20nm 1.8V 8Gb PRAM with 40MB/s program bandwidth , 2012, 2012 IEEE International Solid-State Circuits Conference.

[53]  Onur Mutlu,et al.  A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[54]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[55]  Meng-Fan Chang,et al.  An Offset-Tolerant Fast-Random-Read Current-Sampling-Based Sense Amplifier for Small-Cell-Current Nonvolatile Memory , 2013, IEEE Journal of Solid-State Circuits.

[56]  Meng-Fan Chang,et al.  A High-Speed 7.2-ns Read-Write Random Access 4-Mb Embedded Resistive RAM (ReRAM) Macro Using Process-Variation-Tolerant Current-Mode Read Schemes , 2013, IEEE Journal of Solid-State Circuits.

[57]  Jan Lindström,et al.  IBM solidDB: In-Memory Database Optimized for Extreme Speed and Availability , 2013, IEEE Data Eng. Bull..

[58]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[59]  Tao Li,et al.  Exploring high-performance and energy proportional interface for phase change memory systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[60]  Onur Mutlu,et al.  Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.

[61]  J. Slaughter,et al.  A Fully Functional 64 Mb DDR3 ST-MRAM Built on 90 nm CMOS Technology , 2013, IEEE Transactions on Magnetics.

[62]  Mahmut T. Kandemir,et al.  Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[63]  Doris Schmitt-Landsiedel,et al.  Time-differential sense amplifier for sub-80mV bitline voltage embedded STT-MRAM in 40nm CMOS , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[64]  Seong-Ook Jung,et al.  An Offset-Canceling Triple-Stage Sensing Circuit for Deep Submicrometer STT-RAM , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[65]  Rajeev Balasubramonian,et al.  Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[66]  Laura Carrington,et al.  Evaluation of emerging memory technologies for HPC, data intensive applications , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[67]  Dilpreet Singh,et al.  A survey on platforms for big data analytics , 2014, Journal of Big Data.

[68]  Yuan Xie,et al.  Enabling high-performance LPDDRx-compatible MRAM , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[69]  S. Le,et al.  Perpendicular spin transfer torque magnetic random access memories with high spin torque efficiency and thermal stability for embedded applications (invited) , 2014 .

[70]  Tao Zhang,et al.  Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[71]  Seung H. Kang,et al.  Systematic optimization of 1 Gbit perpendicular magnetic tunnel junction arrays for 28 nm embedded STT-MRAM and beyond , 2015, 2015 IEEE International Electron Devices Meeting (IEDM).

[72]  Seong-Ook Jung,et al.  Latch Offset Cancellation Sense Amplifier for Deep Submicrometer STT-RAM , 2015, IEEE Transactions on Circuits and Systems I: Regular Papers.

[73]  Chankyung Kim,et al.  7.4 A covalent-bonded cross-coupled current-mode sense amplifier for STT-MRAM with 1T1MTJ common source-line structure array , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[74]  Norbert Wehn,et al.  DRAMSpec: A High-Level DRAM Timing, Power and Area Exploration Tool , 2015, International Journal of Parallel Programming.

[75]  Jaejin Lee,et al.  Design considerations of HBM stacked DRAM and the memory architecture extension , 2015, 2015 IEEE Custom Integrated Circuits Conference (CICC).

[76]  Ronald G. Dreslinski,et al.  Enhancing DRAM Self-Refresh for Idle Power Reduction , 2016, ISLPED.

[77]  Jeong-Heon Park,et al.  Dependence of Voltage and Size on Write Error Rates in Spin-Transfer Torque Magnetic Random-Access Memory , 2016, IEEE Magnetics Letters.

[78]  Kee-Won Kwon,et al.  Inverted bit-line sense amplifier with offset-cancellation capability , 2016 .

[79]  M. Bangar,et al.  Systematic validation of 2x nm diameter perpendicular MTJ arrays and MgO barrier for sub-10 nm embedded STT-MRAM with practically unlimited endurance , 2016, 2016 IEEE International Electron Devices Meeting (IEDM).

[80]  Henk Corporaal,et al.  Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs , 2016, IEEE Transactions on Computers.

[81]  Milan Radulovic,et al.  Performance Impact of a Slower Main Memory: A case study of STT-MRAM in HPC , 2016, MEMSYS.

[82]  H. Kanaya,et al.  4Gbit density STT-MRAM using perpendicular MTJ realized with compact cell structure , 2016, 2016 IEEE International Electron Devices Meeting (IEDM).

[83]  Arun Sharma,et al.  Scalable machine‐learning algorithms for big data analytics: a comprehensive review , 2016, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[84]  Kang L. Wang,et al.  Write Error Rate and Read Disturbance in Electric-Field-Controlled Magnetic Random-Access Memory , 2017, IEEE Magnetics Letters.

[85]  William J. Dally,et al.  Architecting an Energy-Efficient DRAM System for GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[86]  Akihito Yamamoto,et al.  23.5 A 4Gb LPDDR2 STT-MRAM with compact 9F2 1T1MTJ cell and hierarchical bitline architecture , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[87]  Jing Li,et al.  Evaluating Row Buffer Locality in Future Non-Volatile Main Memories , 2018, ArXiv.

[88]  Alberto Cano,et al.  A survey on graphic processing unit computing for large‐scale data mining , 2018, WIREs Data Mining Knowl. Discov..