论文信息 - A performance analysis of local synchronization

A performance analysis of local synchronization

Synchronization is often necessary in parallel computing, but it can create delays whenever the receiving processor is idle, waiting for the information to arrive. This is especially true for barrier, or global, synchronization, in which every processor must synchronize with every other processor. Nonetheless, barriers are the only form of synchronization explicitly supplied in MPI and OpenMP.Many applications do not actually require global synchronization; local synchronization, in which a processor synchronizes only with those processors from which it has an incoming edge in some directed graph, is often adequate. However, the behavior of a system under local synchronization is more difficult to analyze, since processors do not all start tasks at the same time.In this paper, we show that if the synchronization graph is a directed cycle and the task times are geometrically distributed with p = 0.5, the time it takes for a processor to complete a task, including synchronization time, approaches an exact limit of 2 + √2 as the number of processors in the cycle approaches infinity. Under global synchronization, however, the time is unbounded, increasing logarithmically with the number of processors. Similar results also apply for p ≠ 0.5.We give a new proof of the constant upper bounds that apply when tasks are normally distributed and the synchronization graph is any graph of bounded degree. We also prove that for some power-law distributions on the tasks, there is no constant upper bound as the number of processors increases, even for the directed cycle. Finally, we show that constant upper bounds apply for some cases of a different synchronization model in which a processor waits for only a subset of its neighbors.

Quentin F. Stout | Julia Lipman

[1] J. S. Huang. A Note on Order Statistics from Pareto Distribution , 1975 .

[2] Allen D. Malony,et al. Stochastic modeling of scaled parallel programs , 1994, Proceedings of 1994 International Conference on Parallel and Distributed Systems.

[3] William E. Weihl,et al. Reducing synchronization overhead in parallel simulation , 1996, Workshop on Parallel and Distributed Simulation.

[4] Moshe Sidi,et al. On the Performance of Synchronized Programs in Distributed Networks with Random Processing Times and Transmission Delays , 1994, IEEE Trans. Parallel Distributed Syst..

[5] Arjan J. C. van Gemund,et al. An Algorithm for Transforming NSP to SP Graphs , 2007 .

[6] Quentin F. Stout,et al. Statistical Analysis of Communication Time on the IBM SP2 , 2008 .

[7] J. Kingman. The First Birth Problem for an Age-dependent Branching Process , 1975 .

[8] C. S. Chang,et al. Bounds on the speedup and efficiency of partial synchronization in parallel processing systems , 1995, JACM.

[9] Charalambos A. Charalambides,et al. Enumerative combinatorics , 2018, SIGA.

[10] Chau-Wen Tseng,et al. Eliminating Barrier Synchronization for Compiler-Parallelized Codes on Software DSMs , 2004, International Journal of Parallel Programming.

[11] Valentín Cardeñoso-Payo,et al. On the Loss of Parallelism by imposing Synchronization Structure , 1997, Euro-PDS.

[12] András Z. Salamon. Task Graph Performance Bounds Through Comparison Methods , 2001 .