Employing transport layer multi-railing in cluster networks

Building clusters from commodity off-the-shelf parts is a well-established technique for building inexpensive medium- to large-size computing clusters. Many commodity mid-range motherboards come with multiple Gigabit Ethernet interfaces, and the low cost per port for Gigabit Ethernet makes switches inexpensive as well. Our objective in this work is to take advantage of multiple inexpensive Gigabit network cards and Ethernet switches to enhance the communication and reliability performance of a cluster. Unlike previous approaches that take advantage of multiple network connections for multi-railing, we consider CMT (Concurrent Multipath Transfer) that extends SCTP (Stream Control Transmission Protocol), a transport protocol developed by the IETF, to make use of the multiple paths that exist between two hosts. In this work, we explore the applicability of CMT in the transport layer of the network stack to high-performance computing environments. We develop SCTP-based MPI (Message Passing Interface) middleware for MPICH2 and Open MPI, and evaluate the reliability and communication performance of the system. Using Open MPI with support for message striping over multiple paths at the middleware level, we compare the differences in supporting multi-railing in the middleware versus at the transport layer.

[1]  Amith R. Mamidala,et al.  Scalable systems software - A software based approach for providing network fault tolerance in clusters with uDAPL interface: MPI level design and performance evaluation , 2006, SC.

[2]  Mitsuhisa Sato,et al.  RI2N/UDP: High bandwidth and fault-tolerant network for a PC-cluster based on multi-link Ethernet , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[3]  Michael Tüxen,et al.  Stream Control Transmission Protocol (SCTP) Specification Errata and Issues , 2006, RFC.

[4]  Van Jacobson,et al.  Congestion avoidance and control , 1988, SIGCOMM '88.

[5]  Fabrizio Petrini,et al.  Using multirail networks in high-performance clusters , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[6]  George Bosilca,et al.  TEG: A High-Performance, Scalable, Multi-network Point-to-Point Communications Methodology , 2004, PVM/MPI.

[7]  Dhabaleswar K. Panda,et al.  Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[8]  Qiaobing Xie,et al.  Stream control transmission protocol (SCTP): a reference guide , 2001 .

[9]  George Bosilca,et al.  Open MPI's TEG Point-to-Point Communications Methodology: Comparison to Existing Implementations , 2004, PVM/MPI.

[10]  Nader Mohamed Self-configuring communication middleware model for multiple network interfaces , 2005, 29th Annual International Computer Software and Applications Conference (COMPSAC'05).

[11]  Janardhan R. Iyengar,et al.  Concurrent Multipath Transfer Using SCTP Multihoming: Introducing the Potentially-Failed Destination State , 2008, Networking.

[12]  P.D. Amer,et al.  Concurrent Multipath Transfer using Transport Layer Multihoming: Performance Under Network Failures , 2006, MILCOM 2006 - 2006 IEEE Military Communications conference.

[13]  Jeffrey S. Chase,et al.  End system optimizations for high-speed TCP , 2001, IEEE Commun. Mag..

[14]  K. Kant,et al.  SCTP Performance in Data Center Environments , 2005 .

[15]  Mark A. Taylor,et al.  Architecture of LA-MPI, a network-fault-tolerant MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[16]  Randall R. Stewart,et al.  Stream Control Transmission Protocol , 2000, RFC.

[17]  Sherali Zeadally,et al.  Stream Control Transmission Protocol (SCTP) , 2008 .

[18]  Guillaume Mercier,et al.  Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem , 2007, Parallel Comput..

[19]  Janardhan R. Iyengar,et al.  Using CMT in SCTP-Based MPI to Exploit Multiple Interfaces in Cluster Nodes , 2007, PVM/MPI.

[20]  David Clark,et al.  An analysis of TCP processing overhead , 1989 .

[21]  Nathalie Furmento,et al.  NEW MADELEINE: a Fast Communication Scheduling Engine for High Performance Networks , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[22]  Janardhan R. Iyengar,et al.  Concurrent Multipath Transfer Using SCTP Multihoming Over Independent End-to-End Paths , 2006, IEEE/ACM Transactions on Networking.

[23]  Bill Fenner,et al.  UNIX Network Programming, Vol. 1 , 2003 .

[24]  Alan Wagner,et al.  SCTP versus TCP for MPI , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[25]  Paul D. Amer,et al.  End-to-end fault tolerance using transport layer multihoming , 2005 .

[26]  Abhinav Vishnu,et al.  A Software Based Approach for Providing Network Fault Tolerance in Clusters with uDAPL interface: MPI Level Design and Performance Evaluation , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[27]  David D. Clark,et al.  An analysis of TCP processing overhead , 1988, IEEE Communications Magazine.

[28]  Nathalie Furmento,et al.  NewMadeleine: a Fast Communication Scheduling Engine for High Performance Networks , 2007 .

[29]  Janardhan R. Iyengar,et al.  Performance implications of a bounded receive buffer in concurrent multipath transfer , 2007, Comput. Commun..

[30]  William Kramer,et al.  Proceedings of the 2005 ACM/IEEE conference on Supercomputing , 2005 .

[31]  Paul D. Amer,et al.  End-to-end concurrent multipath transfer using transport layer multihoming , 2006 .

[32]  W. Richard Stevens,et al.  Unix network programming , 1990, CCRV.

[33]  Jameela Al-Jaroodi,et al.  High-performance message striping over reliable transport protocols , 2006, The Journal of Supercomputing.

[34]  Mitsuhisa Sato,et al.  RI2N/DRV: Multi-link ethernet for high-bandwidth and fault-tolerant network on PC clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[35]  Janardhan R. Iyengar,et al.  Receive buffer blocking in concurrent multipath transfer , 2005, GLOBECOM '05. IEEE Global Telecommunications Conference, 2005..

[36]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[37]  Greg J. Regnier,et al.  TCP performance re-visited , 2003, 2003 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2003..