A collaborative memory system for high-performance and cost-effective clustered architectures

With the fast development of highly integrated distributed systems (cluster systems), especially those encapsulated within a single platform [28, 9], designers have to face interesting memory hierarchy design choices that attempt to avoid disk storage swapping. Disk swapping activities slow down application execution drastically. Leveraging remote free memory through Memory Collaboration has demonstrated its cost-effectiveness compared to overprovisioning for peak load requirements. Recent studies propose several ways on accessing the under-utilized remote memory in static system configurations, without detailed exploration on the dynamic memory collaboration. Dynamic collaboration is an important aspect given the run-time memory usage fluctuations in clustered systems. In this paper, we propose an Autonomous Collaborative Memory System (ACMS) that manages memory resources dynamically at run time, to optimize performance, and provide QoS measures for nodes engaging in the system. We implement a prototype realizing the proposed ACMS, experiment with a wide range of real-world applications, and show up to 3x performance speedup compared to a non-collaborative memory system, without perceivable performance impact on nodes that provide memory. Based on our experiments, we conduct detailed analysis on the remote memory access overhead and provide insights for future optimizations.

[1]  Andrew S. Tanenbaum,et al.  Distributed operating systems , 2009, CSUR.

[2]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[3]  Mitsuhisa Sato,et al.  DLM: A distributed Large Memory System using remote memory swapping over cluster nodes , 2008, 2008 IEEE International Conference on Cluster Computing.

[4]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[5]  Charles R. Maule iWARP ethernet: key to driving ethernet into high performance environments , 2006, SC.

[6]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[7]  Norman P. Jouppi,et al.  High-performance ethernet-based communications for future multi-core processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[8]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[9]  Yingwei Luo,et al.  A Transparent Remote Paging Model for Virtual Machines , 2008 .

[10]  Won-Taek Lim,et al.  Architectural support for operating system-driven CMP cache management , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[12]  Dhabaleswar K. Panda,et al.  Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device , 2005, 2005 IEEE International Conference on Cluster Computing.

[13]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[14]  Yan Solihin,et al.  Evaluating placement policies for managing capacity sharing in CMP architectures with private caches , 2011, TACO.

[15]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[16]  Adding Flexibility to a Remote Memory , 1995 .

[17]  Evangelos P. Markatos,et al.  Adding flexibility to a remote memory pager , 1995, Proceedings of International Workshop on Object Orientation in Operating Systems.

[18]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[19]  Ricardo Bianchini,et al.  Page placement in hybrid memory systems , 2011, ICS '11.

[20]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[21]  Michael Stumm,et al.  RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[22]  Y. Solihin,et al.  Understanding the Limits of Capacity Sharing in CMP Private Caches , 2009 .

[23]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[24]  Rajeev Balasubramonian,et al.  Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[25]  Tajana Simunic,et al.  PDRAM: A hybrid PRAM and DRAM main memory system , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[26]  Moinuddin K. Qureshi Adaptive Spill-Receive for robust high-performance caching in CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[27]  Stephen Bailey,et al.  An Overview of RDMA over IP , 2002 .

[28]  Evangelos P. Markatos,et al.  Implementation of a Reliable Remote Memory Pager , 1996, USENIX ATC.

[29]  Zhiyong Xu,et al.  Collaborative Memory Pool in Cluster System , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[30]  Kuzman Ganchev,et al.  Nswap: A Network Swapping Module for Linux Clusters , 2003, Euro-Par.