A lightweight, high performance communication protocol for grid computing

This paper describes a lightweight, high-performance communication protocol for the high-bandwidth, high-delay networks typical of computational Grids. One unique feature of this protocol is that it incorporates an extremely accurate classification mechanism that is efficient enough to diagnose the cause of data loss in real time, providing to the controller the opportunity to respond to different causes of data loss in different ways. The simplest adaptive response, and the one discussed in this paper, is to trigger aggressive congestion control measures only when the data loss is diagnosed as network related. However, even this very simple adaptation can have a tremendous impact on performance in a Grid setting where the resources allocated to a long-running, data-intensive application can fluctuate significantly during the course of its execution. In fact, we provide results showing that the utilization of the information provided by the classifier increased performance by over two orders of magnitude depending on the dominant cause of data loss. In this paper, we discuss the Bayesian statistical framework upon which the classifier is based and the classification metrics that make this approach highly successful. We discuss the integration of the classifier into the congestion control structures of an existing high-performance communication protocol, and provide empirical results showing that it correctly diagnosed the cause of data loss in over 98% of the experimental trials.

[1]  Nitin H. Vaidya,et al.  Distinguishing congestion losses from wireless transmission losses: a negative result , 1998, Proceedings 7th International Conference on Computer Communications and Networks (Cat. No.98EX226).

[2]  Hyunseung Choo,et al.  TCP WestwoodVT: A Novel Technique for Discriminating the Cause of Packet Loss in Wireless Networks , 2007, Networking.

[3]  Mark Handley,et al.  RFC 5348: TCP Friendly Rate Control (TFRC): Protocol Specification , 2008 .

[4]  David M. Nicol,et al.  Diagnostics for causes of packet loss in a high performance data transfer system , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[5]  Mark Crovella,et al.  Using loss pairs to discover network properties , 2001, IMW '01.

[6]  Andrés Suárez-González,et al.  Improving TCP Vegas Fairness in Presence of Backward Traffic , 2007, IEEE Communications Letters.

[7]  Nitin H. Vaidya,et al.  Discriminating congestion losses from wireless losses using inter-arrival times at the receiver , 1999, Proceedings 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology. ASSET'99 (Cat. No.PR00122).

[8]  Jason Leigh,et al.  Reliable Blast UDP : predictable high performance bulk data transfer , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[9]  David M. Nicol,et al.  Analysis of bounded time warp and comparison with YAWNS , 1996, TOMC.

[10]  Ian T. Foster,et al.  Globus GridFTP: what's new in 2007 , 2007, GridNets '07.

[11]  Larry Peterson,et al.  TCP Vegas: new techniques for congestion detection and avoidance , 1994, SIGCOMM 1994.

[12]  David M. Nicol,et al.  Parallelized Direct Execution Simulation of Message-Passing Parallel Programs , 1996, IEEE Trans. Parallel Distributed Syst..

[13]  Robert L. Grossman,et al.  Supporting Configurable Congestion Control in Data Transport Services , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[14]  Van Jacobson,et al.  TCP Extensions for High Performance , 1992, RFC.

[15]  David M. Nicol,et al.  Parallelized network simulators for message-passing parallel programs , 1995, MASCOTS '95. Proceedings of the Third International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[16]  Nitin H. Vaidya,et al.  "De-Randomizing" congestion losses to improve TCP performance over wired-wireless networks , 2005, IEEE/ACM Transactions on Networking.

[17]  William E. Allcock,et al.  The globus extensible input/output system (XIO): a protocol independent IO system for the grid , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[18]  Phillip M. Dickens A workstation-based parallel direct-execution simulator , 1997 .

[19]  Robert L. Grossman,et al.  Experiences in Design and Implementation of a High Performance Transport Protocol , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[20]  Harvey B Newman,et al.  FAST TCP: From Background Theory to Experiments , 2003 .

[21]  Robert L. Grossman,et al.  UDT: UDP-based data transfer for high-speed wide area networks , 2007, Comput. Networks.

[22]  Francine D. Berman,et al.  The Teragrid Project , 2002 .

[23]  Sally Floyd,et al.  HighSpeed TCP for Large Congestion Windows , 2003, RFC.

[24]  Politi,et al.  Hierarchical approach to complexity with applications to dynamical systems. , 1990, Physical review letters.

[25]  Mark Allman,et al.  An Application-Level solution to TCP''s Satellite Inefficiencies , 1996 .

[26]  Brian D. Noble,et al.  Improving throughput and maintaining fairness using parallel TCP , 2004, IEEE INFOCOM 2004.

[27]  Anastasios A. Tsonis,et al.  Complexity and Predictability of Hourly Precipitation , 1993 .

[28]  Phillip M. Dickens FOBS: A Lightweight Communication Protocol for Grid Computing , 2003, Euro-Par.

[29]  Fernando Paganini,et al.  FAST TCP: from theory to experiments , 2005, IEEE Netw..

[30]  Jonathan B. Postel,et al.  RFC 959: File transfer protocol , 1985 .

[31]  Robert L. Grossman,et al.  PSockets: The Case for Application-level Network Striping for Data Intensive Applications using High Speed Wide Area Networks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[32]  Ibrahim Matta,et al.  Model-based Loss Inference by TCP over Heterogeneous Networks , 2004 .

[33]  Mark Crovella,et al.  Bayesian packet loss detection for TCP , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[34]  B. Hao,et al.  Elementary Symbolic Dynamics And Chaos In Dissipative Systems , 1989 .

[35]  V. Rich Personal communication , 1989, Nature.

[36]  W. G. Bardsley,et al.  SIMFIT - A Computer Package for Simulation, Curve Fitting and Statistical Analysis Using Life Science Models , 1993 .

[37]  Robert L. Grossman,et al.  Optimizing UDP-based Protocol Implementations , 2005 .

[38]  Vern Paxson,et al.  TCP Congestion Control , 1999, RFC.

[39]  Phillip M. Dickens,et al.  Classifiers for the causes of data loss using packet-loss signatures , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[40]  Sally Floyd,et al.  2 What ’ s the Problem ? 2 . 1 Basics TCP uses the following algorithm to adjust its congestion window , 2002 .

[41]  Ian T. Foster,et al.  Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[42]  Pamela C. Cosman,et al.  End-to-end differentiation of congestion and wireless losses , 2003, TNET.

[43]  David M. Nicol,et al.  A distributed memory LAPSE: parallel simulation of message-passing programs , 1994, PADS '94.

[44]  Harvey B Newman,et al.  FAST TCP in High-Speed Networks: An Experimental Study , 2004 .

[45]  Y. Raghu Reddy,et al.  Web100: extended TCP instrumentation for research, education and diagnosis , 2003, CCRV.

[46]  Deborah Estrin,et al.  Recommendations on Queue Management and Congestion Avoidance in the Internet , 1998, RFC.

[47]  Tom Kelly,et al.  Scalable TCP: improving performance in highspeed wide area networks , 2003, CCRV.

[48]  Ibrahim Matta,et al.  End-to-End Inference of Loss Nature in a Hybrid Wired/Wireless Environment , 2002 .

[49]  William Gropp,et al.  High performance wide area data transfers over high performance networks , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[50]  William Gropp,et al.  An evaluation of object-based data transfers on high performance networks , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.