TCP Adaptation for MPI on Long-and-Fat Networks

Typical MPI applications work in phases of computation and communication, and messages are exchanged in relatively small chunks. This behavior is not optimal for TCP because TCP is designed only to handle a contiguous flow of messages efficiently. This behavior anomaly is well-known, but fixes are not integrated into today's TCP implementations, even though performance is seriously degraded, especially for MPI applications. This paper proposes three improvements in the Linux TCP stack: i.e., pacing at start-up, reducing Retransmit-Timeout time, and TCP parameter switching at the transition of computation phases in an MPI application. Evaluation of these improvements using the NAS parallel benchmarks shows that the BT, CG, IS, and SP benchmarks achieved 10 to 30 percent improvements. On the other hand, the FT and MG benchmarks showed no improvement because they have the steady communication that TCP assumes, and the LU benchmark became slightly worse because it has very little communication

[1]  Kei Hiraki,et al.  Inter-Layer Coordination for Parallel TCP Streams on Long Fat Pipe Networks , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[2]  Sally Floyd,et al.  The NewReno Modification to TCP's Fast Recovery Algorithm , 2004, RFC.

[3]  Vern Paxson,et al.  TCP Congestion Control , 1999, RFC.

[4]  Motohiko Matsuda,et al.  Evaluation of MPI implementations on grid-connected clusters using an emulated WAN environment , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[5]  Yuetsu Kodama,et al.  GNET-1: gigabit Ethernet network testbed , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[6]  Randy H. Katz,et al.  TCP Fast Start: A Technique For Speeding Up Web Transfers , 1998 .

[7]  Vern Paxson,et al.  Computing TCP's Retransmission Timer , 2000, RFC.

[8]  Peter Druschel,et al.  TCP: Improving Startup Dynamics by Adaptive Timers and Congestion Control , 1998 .

[9]  Injong Rhee,et al.  Binary increase congestion control (BIC) for fast long-distance networks , 2004, IEEE INFOCOM 2004.

[10]  Ryousei Takano,et al.  Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks , 2005 .

[11]  Vikram Visweswaraiah,et al.  Improving Restart of Idle TCP Connections , 1999 .

[12]  Uwe Walter,et al.  μ-second precision timer support for the Linux kernel , 2002 .

[13]  Joseph D. Touch,et al.  Issues in TCP Slow-Start Restart After Idle , 1998 .

[14]  Van Jacobson,et al.  Congestion avoidance and control , 1988, SIGCOMM '88.

[15]  John S. Heidemann,et al.  Effects of ensemble-TCP , 2000, CCRV.