论文信息 - HL-PCM: MLC PCM Main Memory with Accelerated Read

HL-PCM: MLC PCM Main Memory with Accelerated Read

Multi-Level Cell Phase Change Memory (MLC PCM) is a promising candidate technology for DRAM replacement in main memory of modern computers. Despite of its high density and low power advantages, this technology seriously suffers from slow read and write operations. While prior works extensively studied the problem of slow write, this paper targets high read latency problem in MLC PCM and introduces an architecture mechanism to overcome it. To this end, we rely on the fact that reading different bits from an MLC cell takes different latencies, i.e., for a 2-bit MLC, reading its Most-Significant Bit (MSB) is fast, while reading its Least-Significant Bits (LSBs) is slower. We then propose Half-Line PCM (HL-PCM), a novel memory architecture that leverages this non-uniformity in reading MLC PCM’s content to send a requested memory block to the processor in different cycles–it sends half of a memory block to the processor ahead of the other half. If the processor requested a word belonging to the first half, it can resume its execution on receiving the first half, while the other half has not sent yet and scheduled to be received by the memory controller later. HL-PCM is easy and simple to implement, i.e., it needs minor modifications at memory controller, the search/evict policies at last level cache, as well as data layout in main memory. Our experimental results show that the proposed design improves the average memory access latency by 33–43 percent and program’s execution time by 23 percent, on average, while incurring negligible overhead at memory controller and PCM DIMM, in a 16-core chip multiprocessor (CMP) running memory-intensive benchmarks.

Amin Jadidi | Mohammad Arjomand | Mahmut T. Kandemir | Chita R. Das | Anand Sivasubramaniam

[1] Tao Li,et al. Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[2] Tao Zhang,et al. Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[3] Y.C. Chen,et al. Write Strategies for 2 and 4-bit Multi-Level Phase-Change Memory , 2007, 2007 IEEE International Electron Devices Meeting.

[4] Guido Torelli,et al. A Bipolar-Selected Phase Change Memory Featuring Multi-Level Cell Storage , 2009, IEEE Journal of Solid-State Circuits.

[5] Onur Mutlu,et al. Prefetch-aware shared-resource management for multi-core systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[6] Mark Horowitz,et al. Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis , 2010, ISCA.

[7] Moinuddin K. Qureshi,et al. Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[8] Jun Yang,et al. Improving write operations in MLC phase change memory , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[9] Zhen Fang,et al. Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[10] Amin Jadidi,et al. MLC PCM main memory with accelerated read , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[11] Moinuddin K. Qureshi,et al. Reducing read latency of phase change memory via early read and Turbo Read , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[12] Christian Bienia,et al. PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .

[13] Carole-Jean Wu,et al. PACMan: Prefetch-Aware Cache Management for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14] Mohammad Arjomand,et al. Reducing access latency of MLC PCMs through line striping , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[15] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[16] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[17] T. Schloesser,et al. Challenges for the DRAM cell scaling to 40nm , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[18] Jean-Loup Baer,et al. Cost-effective compiler directed memory prefetching and bypassing , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[19] Long Chen,et al. Memory Architecture for Integrating Emerging Memory Technologies , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[20] David W. Nellans,et al. Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[21] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[22] Ricardo Bianchini,et al. Page placement in hybrid memory systems , 2011, ICS '11.

[23] Zhao Zhang,et al. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[24] Chia-Lin Yang,et al. Push vs. pull: data movement for linked data structures , 2000, ICS '00.

[25] Gu-Yeon Wei,et al. Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[26] Youfeng Wu,et al. Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching , 2002, PLDI '02.

[27] Onur Mutlu,et al. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems , 2007, USENIX Security Symposium.

[28] Fredrik Larsson,et al. Simics: A Full System Simulation Platform , 2002, Computer.

[29] H. W. Carter,et al. Critical words cache memory: exploiting criticality within primary cache miss streams , 2008 .

[30] Onur Mutlu,et al. Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[31] Trevor N. Mudge,et al. A performance comparison of contemporary DRAM architectures , 1999, ISCA.

[32] J. F. Webb,et al. One-dimensional heat conduction model for an electrical phase change random access memory device with an 8F2 memory cell (F=0.15 μm) , 2003 .

[33] S. Phadke,et al. MLP aware heterogeneous memory system , 2011, 2011 Design, Automation & Test in Europe.

[34] Parag Agrawal,et al. The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[35] Cong Xu,et al. NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[36] Onur Mutlu,et al. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[37] Moinuddin K. Qureshi,et al. Morphable memory system: a robust architecture for exploiting multi-level phase change memories , 2010, ISCA.

[38] Onur Mutlu,et al. Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[39] Mahmut T. Kandemir,et al. Meeting midway: Improving CMP performance with memory-side prefetching , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[40] Cloyce D. Spradling. SPEC CPU2006 benchmark tools , 2007, CARN.

[41] Gang Liu,et al. Accurate, timely data prefetching for regular stream, linked data structure, and correlated miss pattern , 2010 .

[42] Apan Qasem,et al. Balancing Locality and Parallelism on Shared-cache Mulit-core Systems , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.