Performance Enhancement for Matrix Multiplication on an SMP PC Cluster

Our study proposes a Reducing-size Task Assignation technique (RTA), which is a novel approach to solve the grain-size problem for the hybrid MPI-OpenMP thread-to-thread (hybrid TC) programming model in performing distributed matrix mulitplication on SMP PC clusters. Applying RTA, hybrid TC achieves an acceptable computation performance while retaining the dynamic task scheduling capability, thereby it can yield a 22% performance improvement for a 16-node cluster of Xeon dual-processor SMPs in comparison with the pure MPI model. Moreover, we provide formulas to predict hybrid TC performance in different circumstances.

[1]  Tsutomu Yoshinaga,et al.  Construction of Hybrid MPI-OpenMP Solutions for SMP Clusters , 2005 .

[2]  Franck Cappello,et al.  Intra node parallelization of MPI programs with OpenMP , 1998 .

[3]  Franck Cappello,et al.  Investigating the performance of two programming models for clusters of SMP PCs , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[4]  J. Choi,et al.  A fast scalable universal matrix multiplication algorithm on distributed-memory concurrent computers , 1997, Proceedings 11th International Parallel Processing Symposium.

[5]  Rolf Rabenseifner,et al.  Hybrid Parallel Programming: Performance Problems and Chances , 2003 .

[6]  Jaeyoung Choi,et al.  Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[7]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[8]  Mitsuhisa Sato,et al.  Implementation and performance evaluation of SPAM particle code with MPI-OpenMP hybrid programming , 2001 .

[9]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[10]  Tsutomu Yoshinaga,et al.  A Hybrid MPI-OpenMP Solution for a Linear System on a Cluster of SMPs , 2003 .

[11]  Gerhard Wellein,et al.  Fast Sparse Matrix-Vector Multiplication for TeraFlop/s Computers , 2002, VECPAR.

[12]  Taisuke Boku,et al.  Implementation and performance evaluation of SPAM particle code with OpenMP-MPI hybrid programming , 2007 .

[13]  Gerhard Wellein,et al.  Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures , 2003, Int. J. High Perform. Comput. Appl..