论文信息 - Memory Systems and Interconnects for Scale-Out Servers

Memory Systems and Interconnects for Scale-Out Servers

The information revolution of the last decade has been fueled by the digitization of almost all human activities through a wide range of Internet services. The backbone of this information age are scale-out datacenters that need to collect, store, and process massive amounts of data. These datacenters distribute vast datasets across a large number of servers, typically into memory-resident shards so as to maintain strict quality-of-service guarantees. While data is driving the skyrocketing demands for scale-out servers, processor and memory manufacturers have reached fundamental efficiency limits, no longer able to increase server energy efficiency at a sufficient pace. As a result, energy has emerged as the main obstacle to the scalability of information technology (IT) with huge economic implications. Delivering sustainable IT calls for a paradigm shift in computer system design. As memory has taken a central role in IT infrastructure, memory-centric architectures are required to fully utilize the IT's costly memory investment. In response, processor architects are resorting to manycore architectures to leverage the abundant request-level parallelism found in data-centric applications. Manycore processors fully utilize available memory resources, thereby increasing IT efficiency by almost an order of magnitude. Because manycore server chips execute a large number of concurrent requests, they exhibit high incidence of accesses to the last-level-cache for fetching instructions (due to large instruction footprints), and off-chip memory (due to lack of temporal reuse in on-chip caches) for accessing dataset objects. As a result, on-chip interconnects and the memory system are emerging as major performance and energy-efficiency bottlenecks in servers. This thesis seeks to architect on-chip interconnects and memory systems that are tuned for the requirements of memory-centric scale-out servers. By studying a wide range of data-centric applications, we uncover application phenomena common in data-centric applications, and examine their implications on on-chip network and off-chip memory traffic. Finally, we propose specialized on-chip interconnects and memory systems that leverage common traffic characteristics, thereby improving server throughput and energy efficiency.

Stavros Volos | Stavros Volos

[1] Yoshiyasu Doi,et al. A 3 Watt 39.8–44.6 Gb/s Dual-Mode SFI5.2 SerDes Chip Set in 65 nm CMOS , 2010, IEEE Journal of Solid-State Circuits.

[2] Krisztián Flautner,et al. Evolution of thread-level parallelism in desktop applications , 2010, ISCA.

[3] Qingyuan Deng,et al. MemScale: active low-power modes for main memory , 2011, ASPLOS XVI.

[4] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5] Babak Falsafi,et al. Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6] Youngmoon Choi,et al. The next-generation 64b SPARC core in a T4 SoC processor , 2012, 2012 IEEE International Solid-State Circuits Conference.

[7] Babak Falsafi,et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[8] Thomas F. Wenisch,et al. Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[9] Babak Falsafi,et al. SHIFT: Shared history instruction fetch for lean-core server processors , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[11] B. Jacob,et al. Buffer-On-Board Memory System , 2012 .

[12] Onur Mutlu,et al. A case for bufferless routing in on-chip networks , 2009, ISCA '09.

[13] Doe Hyun Yoon,et al. The dynamic granularity memory system , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[14] Roland E. Wunderlich,et al. SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[15] Yan Solihin,et al. CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[16] Gabriel H. Loh,et al. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[17] Wei-Fen Lin,et al. Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[18] Daniel A. Jiménez,et al. Reducing network-on-chip energy consumption through spatial locality speculation , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[19] Mikko H. Lipasti,et al. Stealth prefetching , 2006, ASPLOS XII.

[20] Thomas F. Wenisch,et al. SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[21] Thomas F. Wenisch,et al. Temporal instruction fetch streaming , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[22] Lizy Kurian John,et al. Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23] Babak Falsafi,et al. A Case for Specialized Processors for Scale-Out Workloads , 2014, IEEE Micro.

[24] Zhe Wang,et al. Improving writeback efficiency with decoupled last-write prediction , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[25] Sanjeev Kumar,et al. Exploiting spatial locality in data caches using spatial footprints , 1998, ISCA.

[26] Gabriel H. Loh,et al. 3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[27] Mounir Meghelli,et al. A 16-Gb/s Backplane Transceiver With 12-Tap Current Integrating DFE and Dynamic Adaptation of Voltage Offset and Timing Drifts in 45-nm SOI CMOS Technology , 2012, IEEE J. Solid State Circuits.

[28] Babak Falsafi,et al. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[29] Giovanni De Micheli,et al. CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[30] Lei Jiang,et al. Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[31] Onur Mutlu,et al. Express Cube Topologies for on-Chip Interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[32] Babak Falsafi,et al. Toward Dark Silicon in Servers , 2011, IEEE Micro.

[33] Kiyoung Choi,et al. Dynamic power management of off-chip links for Hybrid Memory Cubes , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[34] Willy Zwaenepoel,et al. X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[35] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[36] Bruce Jacob,et al. DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[37] Pervez M. Aziz,et al. A 1.0625-to-14.025Gb/s multimedia transceiver with full-rate source-series-terminated transmit driver and floating-tap decision-feedback equalizer in 40nm CMOS , 2011, 2011 IEEE International Solid-State Circuits Conference.

[38] Jung Ho Ahn,et al. Multicore DIMM: an Energy Efficient Memory Module with Independently Controlled DRAMs , 2009, IEEE Computer Architecture Letters.

[39] Vijayalakshmi Srinivasan,et al. Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[40] Mark Horowitz,et al. Rethinking DRAM Power Modes for Energy Proportionality , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[41] Babak Falsafi,et al. BuMP: Bulk Memory Access Prediction and Streaming , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[42] Parag Agrawal,et al. The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[43] Mark Horowitz,et al. Scaling, Power and the Future of CMOS , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[44] Brian Rogers,et al. Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[45] Elad Alon,et al. A wide common-mode fully-adaptive multi-standard 12.5Gb/s backplane transceiver in 28nm CMOS , 2012, 2012 Symposium on VLSI Circuits (VLSIC).

[46] Babak Falsafi,et al. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[47] Keiichi Yamamoto,et al. An 8Gb/s Transceiver with 3×-Oversampling 2-Threshold Eye-Tracking CDR Circuit for -36.8dB-loss Backplane , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[48] Thomas F. Wenisch,et al. Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.

[49] Andrew B. Kahng,et al. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[50] Thomas Toifl,et al. A 28Gb/s 4-tap FFE/15-tap DFE serial link transceiver in 32nm SOI CMOS technology , 2012, 2012 IEEE International Solid-State Circuits Conference.

[51] Amaresh Malipatil,et al. 2.1 28Gb/s 560mW multi-standard SerDes with single-stage analog front-end and 14-tap decision-feedback equalizer in 28nm CMOS , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[52] Stefanos Kaxiras,et al. Improving CC-NUMA performance using Instruction-based Prediction , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[53] Yoichi Koyanagi,et al. A 4-channel 10.3Gb/s transceiver with adaptive phase equalizer for 4-to-41dB loss PCB channel , 2011, 2011 IEEE International Solid-State Circuits Conference.

[54] John Wu,et al. A 78mW 11.8Gb/s serial link transceiver with adaptive RX equalization and baud-rate CDR in 32nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[55] Hsien-Hsin S. Lee,et al. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[56] Babak Falsafi,et al. Optimizing Data-Center TCO with Scale-Out Processors , 2012, IEEE Micro.

[57] Avinoam Kolodny,et al. Best of both worlds: A bus enhanced NoC (BENoC) , 2010 .

[58] Kai Shen,et al. Virtual Machine Memory Access Tracing with Hypervisor Exclusive Cache , 2007, USENIX Annual Technical Conference.

[59] Luca P. Carloni,et al. Virtual channels vs. multiple physical networks: A comparative analysis , 2010, Design Automation Conference.

[60] J. Jeddeloh,et al. Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[61] Martin L. Kersten,et al. The researcher's guide to the data deluge , 2011, Proc. VLDB Endow..

[62] Jichuan Chang,et al. BOOM: Enabling mobile memory based low-power server DIMMs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[63] G. Tyson,et al. Eager writeback-a technique for improving bandwidth utilization , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[64] James E. Smith,et al. Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[65] Christoforos E. Kozyrakis,et al. Towards energy-proportional datacenter memory with mobile DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[66] Hugh Mair,et al. Analog-DFE-based 16Gb/s SerDes in 40nm CMOS that operates across 34dB loss channels at Nyquist with a baud rate CDR and 1.2Vpp voltage-mode driver , 2011, 2011 IEEE International Solid-State Circuits Conference.

[67] Sandhya Dwarkadas,et al. Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[68] Thomas Krause,et al. A 12.5Gb/s SerDes in 65nm CMOS Using a Baud-Rate ADC with Digital Receiver Equalization and Clock Recovery , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[69] William J. Dally,et al. Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[70] Luiz André Barroso,et al. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[71] Trevor Mudge,et al. Scaling High-Performance Interconnect Architectures to Many-Core Systems , 2012 .

[72] M. Horowitz,et al. A 14-mW 6.25-Gb/s Transceiver in 90-nm CMOS , 2007, IEEE Journal of Solid-State Circuits.

[73] Babak Falsafi,et al. Scale-out NUMA , 2014, ASPLOS.

[74] Lizy Kurian John,et al. The virtual write queue: coordinating DRAM and last-level cache policies , 2010, ISCA.

[75] Li-Shiuan Peh,et al. Flow control and micro-architectural mechanisms for extending the performance of interconnection networks , 2001 .

[76] Ayal Shoval,et al. An 8.4mW/Gb/s 4-lane 48Gb/s multi-standard-compliant transceiver in 40nm digital CMOS technology , 2011, 2011 IEEE International Solid-State Circuits Conference.

[77] Vishnu Balan,et al. 26.1 A 130mW 20Gb/s half-duplex serial link in 28nm CMOS , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[78] George Michelogiannakis,et al. Elastic-buffer flow control for on-chip networks , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[79] Babak Falsafi,et al. Power Scaling: the Ultimate Obstacle to 1K-Core Chips , 2010 .

[80] Babak Falsafi,et al. Database Servers on Chip Multiprocessors: Limitations and Opportunities , 2007, CIDR.

[81] Babak Falsafi,et al. Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[82] Michael Ferdman,et al. Architectural Support for Dynamic Linking , 2015, ASPLOS.

[83] Krishna P. Gummadi,et al. A measurement-driven analysis of information propagation in the flickr social network , 2009, WWW '09.

[84] Song Jiang,et al. Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[85] Mark D. Hill,et al. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[86] Kathryn S. McKinley,et al. Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[87] Eisse Mensink,et al. Low-Power, High-Speed Transceivers for Network-on-Chip Communication , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[88] R. Brama,et al. A multi standard 1.5 to 10Gb/s latch-based 3-tap DFE receiver with a SSC tolerant CDR for serial backplane communication , 2008, 2008 IEEE Symposium on VLSI Circuits.

[89] Mor Harchol-Balter,et al. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[90] David W. Nellans,et al. Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[91] Sriram Sankar,et al. Server Engineering Insights for Large-Scale Online Services , 2010, IEEE Micro.

[92] Thomas F. Wenisch,et al. PowerNap: eliminating server idle power , 2009, ASPLOS.

[93] Sharad Malik,et al. Power-driven design of router microarchitectures in on-chip networks , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[94] Jiangchuan Liu,et al. Statistics and Social Network of YouTube Videos , 2008, 2008 16th Interntional Workshop on Quality of Service.

[95] Herschel A. Ainspan,et al. A 78mW 11.1Gb/s 5-tap DFE receiver with digitally calibrated current-integrating summers in 65nm CMOS , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[96] Hong Liu,et al. Energy proportional datacenter networks , 2010, ISCA.

[97] Gang Lu,et al. Characterization of real workloads of web search engines , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[98] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[99] Jung Ho Ahn,et al. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[100] Onur Mutlu,et al. Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[101] Feifei Li,et al. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[102] Babak Falsafi,et al. TurboTag: Lookup filtering to reduce coherence directory power , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[103] Henry Hoffmann,et al. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[104] Joseph Kennedy,et al. 26.2 A 205mW 32Gb/s 3-Tap FFE/6-tap DFE bidirectional serial link in 22nm CMOS , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[105] Dam Sunwoo,et al. Balancing DRAM locality and parallelism in shared memory CMP systems , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[106] Benjamin C. Lee,et al. Disintegrated control for energy-efficient and heterogeneous memory systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[107] Babak Falsafi,et al. Accurate and complexity-effective spatial pattern prediction , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[108] Chris Fallin,et al. Memory power management via dynamic voltage/frequency scaling , 2011, ICAC '11.

[109] Hui Ding,et al. TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.