NetDIMM: Low-Latency Near-Memory Network Interface Architecture

Optimizing bandwidth was the main focus of designing scale-out networks for several decades and this optimization trend has served well the traditional Internet applications. However, the emergence of datacenters as single computer entities has made latency as important as bandwidth in designing datacenter networks. PCIe interconnect is known to be latency bottleneck in communication networks as its latency overhead can contribute to up to ~90% of the overall communication latency. Despite its overheads, PCIe is the de facto interconnect standard in servers as it has been well established and maintained for more than two decades. In addition to PCIe overhead, data movements in network software stack consume thousands of processor cycles and make ultra-low latency networking more challenging. Tackling PCIe and data movement overheads, we architect NetDIMM, a near-memory network interface card capable of in-memory buffer cloning. NetDIMM places a network interface card chip into the buffer device of a dual in-line memory module and leverages the asynchronous memory access capability of DDR5 to share the memory modules between the host processor and near-memory NIC. Our evaluation shows NetDIMM, on average, improves per packet latency by 49.9% compared with a baseline network deploying PCIe NICs.

[1]  Patrick J. Meaney,et al.  The IBM z13 memory subsystem for big data , 2015, IBM J. Res. Dev..

[2]  Van Jacobson,et al.  Congestion avoidance and control , 1988, SIGCOMM '88.

[3]  Scott Rixner,et al.  Increasing web server throughput with network interface data caching , 2002, ASPLOS X.

[4]  Laxmi N. Bhuyan,et al.  A new server I/O architecture for high speed networks , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[5]  Mendel Rosenblum,et al.  Network Interface Design for Low Latency Request-Response Protocols , 2013, USENIX ATC.

[6]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[7]  Noboru Tanabe,et al.  MEMOnet: network interface plugged into a memory slot , 2000, Proceedings IEEE International Conference on Cluster Computing. CLUSTER 2000.

[8]  Sameer Seth,et al.  TCP/IP architecture, design, and implementation in Linux , 2008 .

[9]  Ronald Minnich,et al.  The memory-integrated network interface , 1995, IEEE Micro.

[10]  Phillipp Bergmann,et al.  Pci Express System Architecture , 2016 .

[11]  Ram Huggahalli,et al.  Architectural Breakdown of End-to-End Latency in a TCP/IP Network , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[12]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[13]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[14]  Michael Kagan,et al.  Performance evaluation of the RDMA over ethernet (RoCE) standard in enterprise data centers infrastructure , 2011 .

[15]  Ram Huggahalli,et al.  Direct cache access for high bandwidth network I/O , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[16]  Thomas F. Wenisch,et al.  Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.

[17]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[18]  George Varghese,et al.  Every microsecond counts: tracking fine-grain latencies with a lossy difference aggregator , 2009, SIGCOMM '09.

[19]  Mingyu Chen,et al.  DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[20]  Ali G. Saidi,et al.  Integrated network interfaces for high-bandwidth TCP/IP , 2006, ASPLOS XII.

[21]  Andrew W. Moore,et al.  Understanding PCIe performance for end host networking , 2018, SIGCOMM.

[22]  Mohammad Alian,et al.  Simulating PCI-Express Interconnect for Future System Exploration , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[23]  Mohammad Alian,et al.  dist-gem5: Distributed simulation of computer clusters , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[24]  Wen-Fong Wang,et al.  Study on enhanced strategies for TCP/IP offload engines , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).

[25]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[26]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[27]  Mark Handley,et al.  Re-architecting datacenter networks and stacks for low latency and high performance , 2017, SIGCOMM.

[28]  Akira Kitamura,et al.  DIMMnet-2: A Reconfigurable Board Connected Into a Memory Slot , 2006, 2006 International Conference on Field Programmable Logic and Applications.

[29]  Dhabaleswar K. Panda,et al.  Performance characterization of a 10-Gigabit Ethernet TOE , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[30]  Amin Vahdat,et al.  Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center , 2012, NSDI.

[31]  Daniel Firestone,et al.  VFP: A Virtual Switch Platform for Host SDN in the Public Cloud , 2017, NSDI.

[32]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Tetsuya Asai,et al.  Caching memcached at reconfigurable network interface , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[34]  Thomas E. Anderson,et al.  Ingress Pipeline Queues Packet Buffer DMA PipelineDMA Egress Pipeline , 2015 .

[35]  Ben Lee,et al.  Platform IO DMA Transaction Acceleration , 2012 .

[36]  Jinjun Xiong,et al.  Application-Transparent Near-Memory Processing Architecture with Memory Channel Network , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[38]  Katerina J. Argyraki,et al.  ResQ: Enabling SLOs in Network Function Virtualization , 2018, NSDI.

[39]  Scott Rixner,et al.  Network interface data caching , 2005, IEEE Transactions on Computers.

[40]  Dong Kyue Kim,et al.  An Efficient Architecture for a TCP Offload Engine Based on Hardware/Software Co-design , 2011, J. Inf. Sci. Eng..

[41]  Kushagra Vaid,et al.  Azure Accelerated Networking: SmartNICs in the Public Cloud , 2018, NSDI.

[42]  Thomas F. Wenisch,et al.  Simulating DRAM controllers for future system architecture exploration , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[43]  Bharat Sukhwani,et al.  ConTutto – A Novel FPGA-based Prototyping Platform Enabling Innovation in the Memory Subsystem of a Server Class Processor , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[44]  Jia Song,et al.  Performance Review of Zero Copy Techniques , 2012 .