Analyzing the impact of supporting out-of-order communication on in-order performance with iWARP

Due to the growing need to tolerate network faults and congestion in high-end computing systems, supporting multiple network communication paths is becoming increasingly important. However, multi-path communication comes with the disadvantage of out-of-order arrival of packets (because packets may traverse different paths). While modern networking stacks such as the Internet Wide-Area RDMA Protocol (iWARP) over 10-Gigabit Ethernet (10GE) support multi-path communication, their current implementations do not handle out-of-order packets primarily owing to the overhead on in-order communication that it adds. Specifically, in iWARP, supporting out-of-order packets requires every packet to carry additional information causing significant overhead on packets that arrive in-order. Thus, in this paper, we analyze the trade-offs in designing a feature-complete iWARP stack, i.e., one that provides support for out-of-order arriving packets, and thus, multi-path systems, while focusing on the performance of in-order communication. We propose three feature-complete designs of iWARP and analyze the pros and cons of each of these designs using performance experiments based on several micro-benchmarks as well as an iso-surface visual rendering application. Our analysis reveals that the iWARP design providing the best overall performance depends on the particular characteristics of the upper layers and that different designs are optimal based on the metric of interest.

[1]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[2]  Hyun-Wook Jin,et al.  Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC , 2005 .

[3]  Dhabaleswar K. Panda,et al.  Performance characterization of a 10-Gigabit Ethernet TOE , 2005, 13th Symposium on High Performance Interconnects (HOTI'05).

[4]  Wu-chun Feng,et al.  The Quadrics network (QsNet): high-performance clustering technology , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[5]  Amith R. Mamidala,et al.  Performance modeling of subnet management on fat tree InfiniBand networks using OpenSM , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[6]  Craig Partridge,et al.  When the CRC and TCP checksum disagree , 2000, SIGCOMM.

[7]  Hyun-Wook Jin,et al.  Supporting iWARP Compatibility and Features for Regular Network Adapters , 2005, 2005 IEEE International Conference on Cluster Computing.

[8]  Hyun-Wook Jin,et al.  Exploiting NIC architectural support for enhancing IP-based protocols on high-performance networks , 2005, J. Parallel Distributed Comput..

[9]  Gregory F. Pfister,et al.  “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[10]  Amith R. Mamidala,et al.  Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[11]  Antonio Robles,et al.  Enforcing In-Order Packet Delivery in PC Clusters using Adaptive Routing , 2004 .

[12]  Annie Foong,et al.  Performance Analysis of iSCSI and Effect of CRC Computation , 2004 .

[13]  Sampath Rangarajan,et al.  On the Performance of TCP Splicing for URL-Aware Redirection , 1999, USENIX Symposium on Internet Technologies and Systems.

[14]  Joel H. Saltz,et al.  A Component-based Implementation of Iso-surface Rendering for Visualizing Large Datasets , 2001 .

[15]  Pete Wyckoff,et al.  Design and Implementation of the iWarp Protocol in Software , 2005, IASTED PDCS.

[16]  Wu-chun Feng,et al.  Optimizing 10-Gigabit Ethernet for Networks of Workstations, Clusters, and Grids: A Case Study , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[17]  Suresh Chalasani,et al.  Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks , 1995, IEEE Trans. Computers.

[18]  Antonio Robles,et al.  In-order packet delivery in interconnection networks using adaptive routing , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[19]  Renato Recio,et al.  Marker PDU Aligned Framing for TCP Specification , 2007, RFC.

[20]  Wu-chun Feng,et al.  Initial end-to-end performance evaluation of 10-Gigabit Ethernet , 2003, 11th Symposium on High Performance Interconnects, 2003. Proceedings..

[21]  Hideharu Amano,et al.  Switch-tagged VLAN Routing Methodology for PC Clusters with Ethernet , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[22]  Greg J. Regnier,et al.  TCP performance re-visited , 2003, 2003 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2003..

[23]  V. Pascucci,et al.  Parallel accelerated isocontouring for out-of-core visualization , 1999, Proceedings 1999 IEEE Parallel Visualization and Graphics Symposium (Cat. No.99EX381).

[24]  Han-Wei Shen,et al.  Parallel view-dependent isosurface extraction using multi-pass occlusion culling , 2001, Proceedings IEEE 2001 Symposium on Parallel and Large-Data Visualization and Graphics (Cat. No.01EX520).

[25]  Martin Herrmann,et al.  Optimization of cyclic redundancy-check codes with 24 and 32 parity bits , 1993, IEEE Trans. Commun..

[26]  Dilip V. Sarwate Computation of cyclic redundancy checks via table look-up , 1988, CACM.

[27]  Pavan Balaji,et al.  Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck , 2004 .

[28]  Cláudio T. Silva,et al.  External memory techniques for isosurface extraction in scientific visualization , 1998, External Memory Algorithms.