Achieving Robustness and Minimizing Overhead in Parallel Algorithms Through Overlapped Communication/Computation

One of the major goals in the design of parallel processing machines and algorithms is to achieve robustness and reduce the effects of the overhead introduced when a given problem is parallelized or a fault occurs. A key contributor to overhead is communication time, in particular when a node is faulty and another node is substuiting for its operation. Many architectures try to reduce this overhead by minimizing the actual time for a communication to occur, including latency and bandwidth figures. Another approach is to hide communication by overlapping it with computation assuming that the computation is the most prominent factor. This paper presents the mechanisms provided in the Proteus parallel computer and its effective use of communication hiding through overlapping communication/computation techniques with and without the presence of a fault. These techniques are easily extended for use in compiler support of parallel programming. We also address the complexity (or rather simplicity) in achieving complete exchange on the Proteus Machine.

[1]  Laxmi N. Bhuyan,et al.  High-performance computer architecture , 1995, Future Gener. Comput. Syst..

[2]  Arun K. Somani,et al.  Rearrangeable Circuit-Switched Hypercube Architecture for Routing Permutations , 1993, J. Parallel Distributed Comput..

[3]  Mamoru Sugie,et al.  Evaluation of the Cluster Structure on the PIM/C Parallel Inference Machine , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[4]  Chung-Ho Chen,et al.  Cache write generate for parallel image processing on shared memory architectures , 1996, IEEE Trans. Image Process..

[5]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[6]  Robert M. Haralick,et al.  Proteus system architecture and organization , 1991, [1991] Proceedings. The Fifth International Parallel Processing Symposium.

[7]  Ramesh C. Agarwal,et al.  An efficient parallel algorithm for the 3-D FFT NAS parallel benchmark , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[8]  Shahid H. Bokhari,et al.  Multiphase complete exchange on Paragon, SP2, and CS-2 , 1996, IEEE Parallel Distributed Technol. Syst. Appl..

[9]  William H. Press,et al.  Numerical recipes in C , 2002 .

[10]  S. Lennart Johnsson,et al.  Algorithms for Matrix Transposition on Boolean n-Cube Configured Ensemble Architectures , 1988, ICPP.

[11]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.

[12]  Anand Sivasubramaniam,et al.  On characterizing bandwidth requirements of parallel applications , 1995, SIGMETRICS '95/PERFORMANCE '95.

[13]  John N. Tsitsiklis,et al.  Optimal Communication Algorithms for Hypercubes , 1991, J. Parallel Distributed Comput..

[14]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[15]  Michael Harrington,et al.  Synchronizing Hypercube Networks in the Presence of Faults , 1994, IEEE Trans. Computers.

[16]  Josep Torrellas,et al.  Comparing the Performance of the DASH and Cedar Multiprocessors for Scientific Applications , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[17]  Rainer Hoch,et al.  From paper to office document standard representation , 1992, Computer.

[18]  Anoop Gupta,et al.  Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[19]  S. Lennart Johnsson,et al.  Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.

[20]  Wolfgang K. Giloi,et al.  Latency hiding in message-passing architectures , 1994, Proceedings of 8th International Parallel Processing Symposium.