A Study on Non-volatile 3D Stacked Memory for Big Data Applications

Recently, big data processing has been an increasingly important field of computer applications, which has attracted a lot of attention from academia and industry. However, it worsens the memory wall problem for processor design, which means a large performance gap between processor computation and memory access. The stacked memory structure has the potential benefits for future processor design such as low latency, large capacity, and high bandwidth. Since these benefits can effectively relieve the problem of memory wall, stacked memory structure has been a promising architecture technique. Such memory structure began to use non-volatile memory (NVM) to provide a faster and larger memory, but its memory access behaviours for big data application have not been fully studied. In order to understand its memory performance better, this paper analyses the NVM 3D stacked structure using simulation method. Since flash memory is the maturest NVM media, this paper uses flash memory as the NVM part in the stacked structure to study, which results in a processor architecture with tightly connected CPU, DRAM and flash layers. In our experiment, channel number, capacity, page size and latency of read and write are test variables. Through observing the evaluation results of eight programs from big data program set, we conclude that the bandwidth and capacity have a significant effect for big data applications, and as bandwidth and capacity increasing, the Read/Write latency of flash and page size show less affection. We also point out some problems about data consistency, channel selection, read and write strategy and data granularity selection. These analysis results are useful for further study and optimization on NVM 3D stacked structure.

[1]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[2]  Youngjae Kim,et al.  FlashSim: A Simulator for NAND Flash-Based Solid-State Drives , 2009, 2009 First International Conference on Advances in System Simulation.

[3]  Alain Greiner,et al.  Architectural exploration of a fine-grained 3D cache for high performance in a manycore context , 2013, 2013 IFIP/IEEE 21st International Conference on Very Large Scale Integration (VLSI-SoC).

[4]  Krishna M. Kavi,et al.  3D DRAM and PCMs in Processor Memory Hierarchy , 2014, ARCS.

[5]  Gabriel H. Loh,et al.  Implementing caches in a 3D technology for high performance processors , 2005, 2005 International Conference on Computer Design.

[6]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[7]  Nisha Talagala,et al.  HEC: improving endurance of high performance flash-based cache devices , 2013, SYSTOR '13.

[8]  Alain Greiner,et al.  Adaptive Stackable 3D Cache Architecture for Manycores , 2012, 2012 IEEE Computer Society Annual Symposium on VLSI.

[9]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Steven Swanson,et al.  Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications , 2009, ASPLOS.

[11]  Yiran Chen,et al.  Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[12]  Trevor N. Mudge,et al.  FlashCache: a NAND flash memory file cache for low power web servers , 2006, CASES '06.

[13]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[14]  Michael M. Swift,et al.  FlashTier: a lightweight, consistent and durable storage cache , 2012, EuroSys '12.

[15]  Trevor N. Mudge,et al.  A limits study of benefits from nanostore-based future data-centric system architectures , 2012, CF '12.

[16]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[17]  Jianhua Li,et al.  ExLRU: A unified write buffer cache management for flash memory , 2011, 2011 Proceedings of the Ninth ACM International Conference on Embedded Software (EMSOFT).

[18]  Narayanan Vijaykrishnan,et al.  Three-dimensional cache design exploration using 3DCacti , 2005, 2005 International Conference on Computer Design.

[19]  Ryszard Kowalczyk,et al.  Smart CloudBench -- Automated Performance Benchmarking of the Cloud , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[20]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[21]  Stefan K. Lai,et al.  Flash memories: Successes and challenges , 2008, IBM J. Res. Dev..

[22]  Krishna M. Kavi,et al.  New Memory Organizations for 3D DRAM and PCMs , 2012, ARCS.

[23]  Mehrzad Samadi,et al.  Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.

[24]  Gang Lu,et al.  CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications , 2012, Frontiers of Computer Science.

[25]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[26]  Parthasarathy Ranganathan,et al.  From Microprocessors to Nanostores: Rethinking Data-Centric Systems , 2011, Computer.

[27]  Babak Falsafi,et al.  Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).