MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters

SMP clusters and multiclusters are widely used to execute message-passing parallel applications. The ways to map parallel processes to processors (or cores) could affect the application performance significantly due to the non-uniform communicating cost in such systems. It is desired to have a tool to map parallel processes to processors (or cores) automatically.Although there have been various efforts to address this issue, the existing solutions either require intensive user intervention, or can not be able to handle the situation of multiclusters well.In this paper, we propose a profile-guided approach to find the optimized mapping automatically to minimize the cost of point-to-point communications for arbitrary message passing applications. The implemented toolset is called MPIPP (MPI Process Placement toolset), and it includes several components:1) A tool to get the communication profile of MPI applications2) A tool to get the network topology of target clusters3) An algorithm to find optimized mapping, which is especially more effective than existing graph partition algorithms for multiclusters.We evaluated the performance of our tool with the NPB benchmarks and three other applications in several clusters. Experimental results show that the optimized process placement generated by our tools can achieve significant speedup.

[1]  JOSEP DÍAZ,et al.  A survey of graph layout problems , 2002, CSUR.

[2]  Bruce Hendrickson,et al.  The Chaco user`s guide. Version 1.0 , 1993 .

[3]  Richard P. Martin,et al.  Effect of Communication Latency, Overhead, and Bandwidth on a Cluster , 1998 .

[4]  Chris Walshaw,et al.  Mesh Partitioning: A Multilevel Balancing and Refinement Algorithm , 2000, SIAM J. Sci. Comput..

[5]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[6]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[7]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[8]  B. Hendrickson The Chaco User � s Guide Version , 2005 .

[9]  Sajal K. Das,et al.  A hierarchical and distributed approach for mapping large applications to heterogeneous grids using genetic algorithms , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[10]  Cheng Chen,et al.  A new scheduling strategy for NUMA multiprocessor systems , 1996, Proceedings of 1996 International Conference on Parallel and Distributed Systems.

[11]  Myunghwan Kim,et al.  An efficient k-way graph partitioning algorithm for task allocation in parallel computing systems , 1990, Systems Integration '90. Proceedings of the First International Conference on Systems Integration.

[12]  Avneesh Pant,et al.  Communicating efficiently on cluster based grids with MPICH-VMI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[13]  D.E. Culler,et al.  Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[14]  Wenguang Chen,et al.  Communication optimization for SMP clusters , 2001 .

[15]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[16]  Virginia Mary Lo,et al.  Heuristic Algorithms for Task Assignment in Distributed Systems , 1988, IEEE Trans. Computers.

[17]  Burkhard Monien,et al.  Graph partitioning with the Party library: helpful-sets in practice , 2004, 16th Symposium on Computer Architecture and High Performance Computing.

[18]  Brian W. Kernighan,et al.  An efficient heuristic procedure for partitioning graphs , 1970, Bell Syst. Tech. J..

[19]  Eorge,et al.  Unstructured Graph Partitioning and Sparse Matrix Ordering System Version 2 . 0 , 1995 .

[20]  Jake K. Aggarwal,et al.  A Mapping Strategy for Parallel Processing , 1987, IEEE Transactions on Computers.

[21]  R.A. Fiedler Optimization and Scaling of Shared-Memory and Message-Passing Implementations of the Zeus Hydrodynamics Algorithm , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[22]  Jesper Larsson Träff Implementing the MPI process topology mechanism , 2002, SC '02.

[23]  郑纬民,et al.  Communication Optimization for SMP Clusters , 2001 .