Assessing the cost of redistribution followed by a computational kernel: Complexity and performance results

Algorithms for finding the optimal distribution compatible with a given data partition.Analysis of the algorithms for different cost metrics.NP-completeness proof for the redistribution problem followed by a computational kernel.Experimental results for the 1D-stencil kernel and the QR factorization algorithm. The classical redistribution problem aims at optimally scheduling communications when reshuffling from an initial data distribution to a target data distribution. This target data distribution is usually chosen to optimize some objective for the algorithmic kernel under study (good computational balance or low communication volume or cost), and therefore to provide high efficiency for that kernel. However, the choice of a distribution minimizing the target objective is not unique. This leads to generalizing the redistribution problem as follows: find a re-mapping of data items onto processors such that the data redistribution cost is minimal, and the operation remains as efficient. This paper studies the complexity of this generalized problem. We compute optimal solutions and evaluate, through simulations, their gain over classical redistribution. We also show the NP-hardness of the problem to find the optimal data partition and processor permutation (defined by new subsets) that minimize the cost of redistribution followed by a simple computational kernel. Finally, experimental validation of the new redistribution algorithms are conducted on a multicore cluster, for both a 1D-stencil kernel and a more compute-intensive dense linear algebra routine.

[1]  Z Liu,et al.  Scheduling Theory and its Applications , 1997 .

[2]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[3]  Richard M. Karp,et al.  A n^5/2 Algorithm for Maximum Matchings in Bipartite Graphs , 1971, SWAT.

[4]  Tsan-sheng Hsu,et al.  Task Allocation on a Network of Processors , 2000, IEEE Trans. Computers.

[5]  H. Ali,et al.  Task Scheduling in Multiprocessing Systems , 1995, Computer.

[6]  Thomas Hérault,et al.  Determining the Optimal Redistribution for a Given Data Partition , 2014, 2014 IEEE 13th International Symposium on Parallel and Distributed Computing.

[7]  Tsan-sheng Hsu,et al.  Scheduling Problems in a Practical Allocation Model , 1997, J. Comb. Optim..

[8]  Geoffrey C. Fox,et al.  Runtime array redistribution in HPF programs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[9]  Yves Robert,et al.  Scheduling Block-Cyclic Array Redistribution , 1998, IEEE Trans. Parallel Distributed Syst..

[10]  Bernard Tourancheau,et al.  Efficient Block Cyclic Data Redistribution , 1996, Euro-Par, Vol. I.

[11]  G. Smith,et al.  Numerical Solution of Partial Differential Equations: Finite Difference Methods , 1978 .

[12]  Michael G. Norman,et al.  Models of machines and computation for mapping in multicomputers , 1993, CSUR.

[13]  Jack J. Dongarra,et al.  Software Libraries for Linear Algebra Computations on High Performance Computers , 1995, SIAM Rev..

[14]  R. Noyé,et al.  Numerical Solutions of Partial Differential Equations , 1983 .

[15]  Yves Robert,et al.  A realistic model and an efficient heuristic for scheduling with heterogeneous processors , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[16]  Yi Pan,et al.  Improving communication scheduling for array redistribution , 2005, J. Parallel Distributed Comput..

[17]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[18]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[19]  Monika Richter Scheduling And Load Balancing In Parallel And Distributed Systems , 2016 .

[20]  Edward G. Coffman,et al.  Scheduling File Transfers , 1985, SIAM J. Comput..

[21]  Joseph Hall,et al.  Algorithms for Data Migration , 2008, Algorithmica.

[22]  Yoo-Ah Kim,et al.  Data migration to minimize the total completion time , 2005, J. Algorithms.

[23]  Shamkant B. Navathe,et al.  Scheduling data redistribution in distributed databases , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[24]  Richard M. Karp,et al.  A n^5/2 Algorithm for Maximum Matchings in Bipartite Graphs , 1971, SWAT.

[25]  Lionel M. Ni,et al.  Processor Mapping Techniques Toward Efficient Data Redistribution , 1995, IEEE Trans. Parallel Distributed Syst..

[26]  Viktor K. Prasanna,et al.  Efficient collective communication in distributed heterogeneous systems , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[27]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[28]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[29]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[30]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[31]  David W. Walker,et al.  Redistribution of block-cyclic data distributions using MPI , 1996, Concurr. Pract. Exp..

[32]  Alexander Schrijver,et al.  Combinatorial optimization. Polyhedra and efficiency. , 2003 .

[33]  Michael Stonebraker,et al.  SciDB DBMS Research at M.I.T , 2013, IEEE Data Eng. Bull..

[34]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[35]  Jan Mayer,et al.  A numerical evaluation of preprocessing and ILU-type preconditioners for the solution of unsymmetric sparse linear systems using iterative methods , 2009, TOMS.

[36]  Lei Wang,et al.  Runtime Performance of Parallel Array Assignment: An Empirical Study , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.