Integrated 3D-stacked server designs for increasing physical density of key-value stores

Key-value stores, such as Memcached, have been used to scale web services since the beginning of the Web 2.0 era. Data center real estate is expensive, and several industry experts we have spoken to have suggested that a significant portion of their data center space is devoted to key value stores. Despite its wide-spread use, there is little in the way of hardware specialization for increasing the efficiency and density of Memcached; it is currently deployed on commodity servers that contain high-end CPUs designed to extract as much instruction-level parallelism as possible. Out-of-order CPUs, however have been shown to be inefficient when running Memcached. To address Memcached efficiency issues, we propose two architectures using 3D stacking to increase data storage efficiency. Our first 3D architecture, Mercury, consists of stacks of ARM Cortex-A7 cores with 4GB of DRAM, as well as NICs. Our second architecture, Iridium, replaces DRAM with NAND Flash to improve density. We explore, through simulation, the potential efficiency benefits of running Memcached on servers that use 3D-stacking to closely integrate low-power CPUs with NICs and memory. With Mercury we demonstrate that density may be improved by 2.9X, power efficiency by 4.9X, throughput by 10X, and throughput per GB by 3.5X over a state-of-the-art server running optimized Memcached. With Iridium we show that density may be increased by 14X, power efficiency by 2.4X, and throughput by 5.2X, while still meeting latency requirements for a majority of requests.

[1]  Babak Falsafi,et al.  Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[2]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[3]  Bruce A. Smith,et al.  On the performance and use of dense servers , 2003, IBM J. Res. Dev..

[4]  Yiannakis Sazeides,et al.  Thermal characterization of cloud workloads on a power-efficient server-on-chip , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[5]  Krisztián Flautner,et al.  PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor , 2006, ASPLOS XII.

[6]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[7]  Babak Falsafi,et al.  Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors , 2012, TOCS.

[8]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[9]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[10]  Kushagra Vaid,et al.  Web search using mobile cores: quantifying and mitigating the price of efficiency , 2010, ISCA.

[11]  Young-Hyun Jun,et al.  A 1.2V 12.8GB/s 2Gb mobile Wide-I/O DRAM with 4×128 I/Os using TSV-based stacking , 2011, 2011 IEEE International Solid-State Circuits Conference.

[12]  Greg Grohoski Niagara-2: A highly threaded server-on-a-chip , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[13]  Trevor N. Mudge,et al.  A limits study of benefits from nanostore-based future data-centric system architectures , 2012, CF '12.

[14]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[15]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[16]  Jung-Bae Lee,et al.  A 1.2V 30nm 1.6Gb/s/pin 4Gb LPDDR3 SDRAM with input skew calibration and enhanced control scheme , 2012, 2012 IEEE International Solid-State Circuits Conference.

[17]  David Blaauw,et al.  Exploring DRAM organizations for energy-efficient and resilient exascale memories , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[19]  Trevor N. Mudge,et al.  Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008, 2008 International Symposium on Computer Architecture.

[20]  Thomas F. Wenisch,et al.  Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.

[21]  Eitan Frachtenberg,et al.  Many-core key-value store , 2011, 2011 International Green Computing Conference and Workshops.

[22]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[23]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[25]  Y. Iwata,et al.  Pipe-shaped BiCS flash memory with 16 stacked layers and multi-level-cell operation for ultra high density storage devices , 2006, 2009 Symposium on VLSI Technology.

[26]  Trevor N. Mudge,et al.  Improving NAND Flash Based Disk Caches , 2008, 2008 International Symposium on Computer Architecture.

[27]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[28]  Trevor N. Mudge,et al.  Using non-volatile memory to save energy in servers , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[29]  Young-Hyun Jun,et al.  A 1.2 V 12.8 GB/s 2 Gb Mobile Wide-I/O DRAM With 4 $\times$ 128 I/Os Using TSV Based Stacking , 2011, IEEE Journal of Solid-State Circuits.

[30]  Trevor N. Mudge,et al.  FlashCache: a NAND flash memory file cache for low power web servers , 2006, CASES '06.