A Fast-Start, Fault-Tolerant MPI Launcher on Dawning Supercomputers

Daemon-based MPI launchers are the mainstream in nowadays, because they can startup processes rapidly. However, effective task management and fault tolerance become more important as the scale of supercomputers enlarges. A new fast-start and fault tolerant launcher, called SFLauncher, has been used to startup MPICH task on Dawning supercomputers. This paper details its features and implementation, with emphasis on scalability, self-organization algorithm and garbage reclamation. The results of performance evaluation on SFLauncher are also given.