Towards 100 gbit/s ethernet: multicore-based parallel communication protocol design

Ethernet line rates are projected to reach 100 Gbits/s by as soon as 2010. While in principle suitable for high performance clustered and parallel applications, Ethernet requires matching improvements in the system software stack. In this paper we address several sources of CPU and memory system overhead in the I/O path at line rates reaching 80 Gbits/s (bi-directional), using multiple 10 Gbit/s links per system node. Key contributions of our work are the design of a parallel high-performance communication protocol that uses context-independent page-remapping to (a) reduce packet processing overheads; (b) reduce thread management and synchronization overheads; and (c) address affinity issues in NUMA multicore CPUs. Our design result in the full 40 Gbits/s of available one-way Ethernet bandwidth and in 57.6 Gbits/s (72%) of the 80 Gbits/s maximum bidirectional throughput (limited only by the memory system), while leaving ample CPU cycles for application processing.

[1]  Mitsuhisa Sato,et al.  A scalable communication layer for multi-dimensional hyper crossbar network using multiple gigabit ethernet , 2006, ICS '06.

[2]  J. Duncanson,et al.  Inverse multiplexing , 1994, IEEE Communications Magazine.

[3]  F. M. Chiussi,et al.  Generalized inverse multiplexing of switched ATM connections , 1998, IEEE GLOBECOM 1998 (Cat. NO. 98CH36250).

[4]  W. Richard Stevens,et al.  TCP/IP Illustrated, Volume 2: The Implementation , 1995 .

[5]  P. Wyckoff,et al.  EMP: Zero-Copy OS-Bypass NIC-Driven Gigabit Ethernet Message Passing , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[6]  W. Richard Stevens,et al.  TCP/IP illustrated (vol. 2): the implementation , 1995 .

[7]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[8]  Angelos Bilas,et al.  MultiEdge: An Edge-based Communication Subsystem for Scalable Commodity Servers , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[9]  Margo I. Seltzer,et al.  Making the Most Out of Direct-Access Network Attached Storage , 2003, FAST.

[10]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[11]  Cezary Dubnicki,et al.  VMMC-2 : Efficient Support for Reliable, Connection-Oriented Communication , 1997 .

[12]  Jeffrey C. Mogul,et al.  TCP Offload Is a Dumb Idea Whose Time Has Come , 2003, HotOS.

[13]  Fabrizio Petrini,et al.  Using multirail networks in high-performance clusters , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[14]  Thorsten von Eicken,et al.  ATM and fast Ethernet network interfaces for user-level communication , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[15]  Angelos Bilas,et al.  Exploiting spatial parallelism in Ethernet-based cluster interconnects , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[16]  Scott Rixner,et al.  An efficient programmable 10 gigabit Ethernet network interface card , 2005, 11th International Symposium on High-Performance Computer Architecture.

[17]  Angelos Bilas,et al.  User-Space Communication: A Quantitative Study , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[18]  Sherali Zeadally,et al.  Stream Control Transmission Protocol (SCTP) , 2008 .

[19]  Scott Pakin,et al.  Fast messages: efficient, portable communication for workstation clusters and MPPs , 1997, IEEE Concurrency.

[20]  José Carlos Brustoloni,et al.  Interoperation of copy avoidance in network and file I/O , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[21]  Andrew A. ChienJanuary Fast Messages ( FM ) : E cient , Portable Communication for Workstation Clusters and Massively-Parallel Processors , 1997 .

[22]  Brice Goglin,et al.  Improving message passing over Ethernet with I/OAT copy offload in Open-MX , 2008, 2008 IEEE International Conference on Cluster Computing.

[23]  Dhabaleswar K. Panda,et al.  Microbenchmark performance comparison of high-speed cluster interconnects , 2004, IEEE Micro.

[24]  W. Vogels,et al.  A User-Level Network Interface for Parallel and Distributed Computing , 1995 .

[25]  Andrew J. T. Colin,et al.  The Implementation , 1972, Softw. Pract. Exp..

[26]  Larry L. Peterson,et al.  Fbufs: a high-bandwidth cross-domain transfer facility , 1994, SOSP '93.