Design Techniques for the Scalability of Cluster Management Software on Dawning Supercomputers

Cluster management software has faced more increased scalability challenge with ever enlarged cluster scale. Its good scalability rests with feasible design techniques focusing on hybrid software topologies with partitioning policy, non-blocking I/O multiplexing and message on demand. Design patterns are generic solutions to recurring software design problems, and above three important techniques are abstracted the design pattern of scalable cluster management software in this paper. According to this design pattern, some cluster management tools, such as job scheduling, MPI job launcher and so on, have been designed and applied on Dawning supercomputers. Some results of performance evaluation have shown that good scalability of cluster management software on Dawning supercomputers has benefited from this design pattern.

[1]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[2]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[3]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[4]  Y. Charlie Hu,et al.  A Self-Organizing Flock of Condors , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[5]  Y. Charlie Hu,et al.  A self-organizing flock of Condors , 2006, J. Parallel Distributed Comput..

[6]  C. Murray Woodside,et al.  Evaluating the scalability of distributed systems , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[7]  S. Nimmagadda,et al.  Scalability Issues in Cluster Computing Operating Systems , 1999 .

[8]  James H. Laros,et al.  An extensible, portable, scalable cluster management software architecture , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[9]  Erik A. Hendriks,et al.  BProc: the Beowulf distributed process space , 2002, ICS '02.