论文信息 - Low-Span Parallel Algorithms for the Binary-Forking Model

Low-Span Parallel Algorithms for the Binary-Forking Model

The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Θ(łog n) to spawn or synchronize n tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging. In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: (1) A randomized comparison-based sorting algorithm with optimal O(łog n) span and O(nłog n) work, both w.h.p. in n. (2) An optimal O(łog n) span algorithm for Strassen's matrix multiplication (MM) with only a łogłog n -factor blow-up in work as well as a near-optimal O(łog n łogłog łog n) span algorithm with no asymptotic blow-up in work. (3) A near-optimal O(łog n łogłogłog n) span Fast Fourier Transform (FFT) algorithm with less than a łog n-factor blow-up in work for all practical values of n (i.e., n łe 10 ^10,000 ).

[1] V. Strassen. Gaussian elimination is not optimal , 1969 .

[2] Yossi Matias,et al. The Queue-Read Queue-Write Asynchronous PRAM Model , 1996, Theor. Comput. Sci..

[3] Michael T. Goodrich,et al. Atomic Power in Forks: A Super-Logarithmic Lower Bound for Implementing Butterfly Networks in the Nonatomic Binary Fork-Join Model , 2021, SODA.

[4] Steven Skiena,et al. Data Races and the Discrete Resource-time Tradeoff Problem with Resource Reuse over Paths , 2019, SPAA.

[5] Steven G. Johnson,et al. A Modified Split-Radix FFT With Fewer Arithmetic Operations , 2007, IEEE Transactions on Signal Processing.

[6] Guy E. Blelloch,et al. Effectively sharing a cache among threads , 2004, SPAA '04.

[7] W. Donald Frazer,et al. Samplesort: A Sampling Approach to Minimal Storage Tree Sorting , 1970, JACM.

[8] Daniel N. Rockmore,et al. The FFT: an algorithm the whole family can use , 2000, Comput. Sci. Eng..

[9] Yan Gu,et al. Efficient Stepping Algorithms and Implementations for Parallel Shortest Paths , 2021, SPAA.

[10] Guy E. Blelloch,et al. Parallel Algorithms for Asymmetric Read-Write Costs , 2016, SPAA.

[11] Jack J. Dongarra,et al. Guest Editors Introduction to the top 10 algorithms , 2000, Comput. Sci. Eng..

[12] S. Sitharama Iyengar,et al. Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[13] Roland C Le Bail,et al. Use of fast fourier transforms for solving partial differential equations in physics , 1972 .

[14] Richard Cole,et al. Parallel merge sort , 1988, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[15] Guy E. Blelloch,et al. Low depth cache-oblivious algorithms , 2010, SPAA '10.

[16] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[17] G. Bruun. z-transform DFT filters and FFT's , 1978 .

[18] S. Winograd. On computing the Discrete Fourier Transform. , 1976, Proceedings of the National Academy of Sciences of the United States of America.

[19] Xin-She Yang,et al. Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[20] Charles E. Leiserson,et al. Space-efficient scheduling of multithreaded computations , 1993, SIAM J. Comput..

[21] Guy E. Blelloch,et al. Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[22] E. Shi,et al. Data Oblivious Algorithms for Multicores , 2020, IACR Cryptol. ePrint Arch..

[23] Richard Cole,et al. Resource Oblivious Sorting on Multicores , 2010, ICALP.

[24] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .

[25] Guy E. Blelloch,et al. Scheduling irregular parallel computations on hierarchical caches , 2011, SPAA '11.

[26] Irving John Good,et al. The Interaction Algorithm and Practical Fourier Analysis , 1958 .

[27] Guy E. Blelloch,et al. The data locality of work stealing , 2000, SPAA.

[28] Victor Y. Pan,et al. Strassen's algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).