Marlin: A memory-based rack area network

Disaggregation of hardware resources that are traditionally embedded within individual servers into separate resource pools is an emerging architectural trend in hyperscale data center design, as exemplified by Facebook's disaggregated rack architecture. This paper presents the design, implementation and evaluation of a PCIe-based rack area network system called Marlin, which is designed to support the communications and resource sharing needs of disaggregated racks. By virtue of being based on PCIe, Marlin presents a memory-based addressing model for both I/O device sharing among multiple hosts and inter-host communications. That is, when a node communicates with other nodes or accesses resources in the same rack, it uses memory read and write operations. In the area of inter-node communications, Marlin offers hardware-based remote direct memory access (HRDMA) as a first-class communications primitive between servers within a rack. In addition, Marlin supports socket-based communications for legacy network applications and cross-machine zero memory copying for applications designed specifically to take full advantage of memory-based communications. Empirical measurements on a fully operational Mar-lin prototype based on 4-lane Gen3 PCIe technology show that the one-way kernel-to-kernel latency is 8.5μsec and the end-to-end sustainable TCP throughput is 19.6 Gbps.

[1]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[2]  Richard B. Gillett Memory Channel Network for PCI , 1996, IEEE Micro.

[3]  Tzi-cker Chiueh,et al.  Secure I/O device sharing among virtual machines on multiple hosts , 2013, ISCA.

[4]  John Byrne,et al.  Power-efficient networking for balanced system designs: early experiences with PCIe , 2011, HotPower '11.

[5]  Jun Suzuki,et al.  Multi-root Share of Single-Root I/O Virtualization (SR-IOV) Compliant PCI Express Device , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[6]  Gil Neiger,et al.  Intel ® Virtualization Technology for Directed I/O , 2006 .

[7]  Matthew Mathis,et al.  Forward acknowledgement: refining TCP congestion control , 1996, SIGCOMM '96.

[8]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[9]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[10]  Jimi Xenidis,et al.  Utilizing IOMMUs for Virtualization in Linux and Xen Muli , 2006 .

[11]  Kwok Kong,et al.  Application Note AN-571 PCI Express® System Interconnect Software Architecture for x86-based Systems , 2007 .

[12]  Dhabaleswar K. Panda,et al.  Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device , 2005, 2005 IEEE International Conference on Cluster Computing.

[13]  Venkata Krishnan Towards an integrated IO and clustering solution using PCI express , 2007, 2007 IEEE International Conference on Cluster Computing.

[14]  Mitsuhisa Sato,et al.  PEARL: Power-Aware, Dependable, and High-Performance Communication Link Using PCI Express , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[15]  Luigi Rizzo,et al.  netmap: A Novel Framework for Fast Packet I/O , 2012, USENIX ATC.

[16]  Andrew W. Moore,et al.  Motivating future interconnects: a differential measurement analysis of PCI latency , 2009, ANCS '09.

[17]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[18]  Richard Kaufmann,et al.  Using the Memory Channel Network , 1997, IEEE Micro.

[19]  Kai Li,et al.  Protected, user-level DMA for the SHRIMP network interface , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[20]  Alan L. Cox,et al.  Protection Strategies for Direct Access to Virtualized I/O Devices , 2008, USENIX Annual Technical Conference.

[21]  Thomas F. Wenisch,et al.  System-level implications of disaggregated memory , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[22]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[23]  Milon Mackey,et al.  An implementation of the Hamlyn sender-managed interface architecture , 1996, OSDI '96.

[24]  References , 1971 .