Load Balancing in Cluster Using BLCR Checkpoint/Restart

Modern computation is becoming complex in a way that the resource requirement is gradually increasing. High Throughput Computing is one technique to deal with such a complexity. After a significant amount of time, computing clusters gets highly overloaded resulting in degradation of performance. Since there is no central coordinator in Computer Supported Cooperative Working (CSCW) load-balancing is more complex. An overloaded node does not participate in a CSCW network as they are already overloaded. This paper proposes migration of computation intensive jobs from overloaded nodes, which will allow overloaded nodes to be able to participate in CSCW. The proposed solution improves the performance by making more nodes participating in CSCW by migrating compute intensive jobs from overloaded nodes to underloaded nodes. Evaluation of proposed approach shows that the availability and performance of the CSCW clusters is improved by 30%-40% with fault-tolerance based load balancing.

[1]  R. B. Patel,et al.  A Framework for Distributed Dynamic Load Balancing in Heterogeneous Cluster , 2007 .

[2]  Filip De Turck,et al.  Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids , 2009, IEEE Transactions on Parallel and Distributed Systems.

[3]  Richard L. Graham,et al.  Analyzing fault aware collective performance in a process fault tolerant MPI , 2012, Parallel Comput..

[4]  Kassem Saleh,et al.  An efficient process migration algorithm for homogeneous clusters , 1996, Inf. Softw. Technol..

[5]  Lanfranco Lopriore,et al.  Object and process migration in a single-address-space distributed system , 2000, Microprocess. Microsystems.

[6]  Yahya Slimani,et al.  Task Load Balancing Strategy for Grid Computing , 2007 .

[7]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[8]  S. Kumar,et al.  Generalization of Trapezoidal Vague Set and Its Use for Analyzing the Fuzzy System Reliability , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[9]  Gabriel Rodríguez,et al.  Performance evaluation of an application-level checkpointing solution on grids , 2010, Future Gener. Comput. Syst..

[10]  Stephen A. Jarvis,et al.  Grid load balancing using intelligent agents , 2005, Future Gener. Comput. Syst..

[11]  A. Ecer,et al.  DLB — A Dynamic Load Balancing Tool for Grid Computing , 1996 .

[12]  Anton Selikhov,et al.  A Channel Memory based fault tolerance for MPI applications , 2005, Future Gener. Comput. Syst..