Adaptive Scalable RPC Timeout Mechanism for Large Scale Clusters

Timeouts are usually used for failure detection in RPC (remote produce call) based systems, which are typically reported on a per-call basis. During pressure testing, on a very large cluster system, it has been found that the traditional fixed timeout mechanism leads lots of unnecessary timeouts, especially when the server loading is involved. This paper proposes an Adaptive Scalable RPC Timeout (AST for short) mechanism that considers network conditions, server load, scalability, and performance. Under this control, the timeout value, set by clients, can be adapted and adjusted in a dynamic fashion, according to congestion of the network and the server. Moreover, the server can notify the client to modify the timeout value of the RPC. Via a series of simulations, it has been proved that the AST mechanism is a more suitable failure detection mechanism for RPC models with timeouts, and it enhances the system responsibility, reliability, and stability without negative impact on performance, even for large-scaled cluster systems.

[1]  Carla Schlatter Ellis,et al.  File-Access Characteristics of Parallel Scientific Workloads , 1996, IEEE Trans. Parallel Distributed Syst..

[2]  Akkihebbal L. Ananda,et al.  A survey of remote procedure calls , 1990, OPSR.

[3]  Andrew Birrell,et al.  Implementing remote procedure calls , 1984, TOCS.

[4]  Kenneth P. Birman,et al.  Consistent Failure Reporting in Reliable Communication Systems , 1993 .

[5]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[6]  Santosh K. Shrivastava,et al.  Rajdoot: A Remote Procedure Call Mechanism Supporting Orphan Detection and Killing , 1988, IEEE Trans. Software Eng..

[7]  Paul J. Leach,et al.  The network computing architecture and system: an environment for developing distributed applications , 1988, Digest of Papers. COMPCON Spring 88 Thirty-Third IEEE Computer Society International Conference.

[8]  Mark R. Fahey,et al.  I/O performance on a massively parallel Cray XT3/XT4 , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9]  Brent B. Welch,et al.  The Sprite Remote Procedure Call System , 1986 .

[10]  Gilles Muller,et al.  Scaling up partial evaluation for optimizing the Sun commercial RPC protocol , 1997 .

[11]  Peter Honeyman,et al.  Performance of DCE RPC , 1995, Second International Workshop on Services in Distributed and Networked Environments.

[12]  Aurelien Bouteiller,et al.  Fault Tolerance Management for a Hierarchical GridRPC Middleware , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).