Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors

This paper describes ongoing research at Oak Ridge National Laboratory into the issues and potential problems of algorithm scalability to 100,000 processor systems. Such massively parallel computers are projected to be needed to reach a petaflops computational speed before 2010. And to make such hypothetical machines a reality, IBM Research has begun developing a computer named “BlueGene” that could have up to 65,536 processor chips in the 2005 time frame. A key issue is how to effectively utilize a machine with 100,000 processors. Scientific algorithms have shown poor scalability on 10,000 processor systems that exist today. In this paper we define a new term called super-scalable algorithms, which have the property of natural fault tolerance, then go on to show that such algorithms do exist for scientific applications. Finally, we describe a 100,000 processor simulator we have developed to test the new algorithms.

[1]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[2]  Peter Alan Lee,et al.  Fault Tolerance , 1990, Dependable Computing and Fault-Tolerant Systems.

[3]  Laxmikant V. Kalé,et al.  A parallel-object programming model for petaflops machines and blue gene/cyclops , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[4]  Guirong Liu Mesh Free Methods: Moving Beyond the Finite Element Method , 2002 .

[5]  Gérard M. Baudet,et al.  Asynchronous Iterative Methods for Multiprocessors , 1978, JACM.

[6]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[7]  Jack J. Dongarra,et al.  HARNESS and fault tolerant MPI , 2001, Parallel Comput..