Low-Span Parallel Algorithms for the Binary-Forking Model

The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Θ(łog n) to spawn or synchronize n tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging. In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: (1) A randomized comparison-based sorting algorithm with optimal O(łog n) span and O(nłog n) work, both w.h.p. in n. (2) An optimal O(łog n) span algorithm for Strassen's matrix multiplication (MM) with only a łogłog n -factor blow-up in work as well as a near-optimal O(łog n łogłog łog n) span algorithm with no asymptotic blow-up in work. (3) A near-optimal O(łog n łogłogłog n) span Fast Fourier Transform (FFT) algorithm with less than a łog n-factor blow-up in work for all practical values of n (i.e., n łe 10 ^10,000 ).

[1]  V. Strassen Gaussian elimination is not optimal , 1969 .

[2]  Yossi Matias,et al.  The Queue-Read Queue-Write Asynchronous PRAM Model , 1996, Theor. Comput. Sci..

[3]  Michael T. Goodrich,et al.  Atomic Power in Forks: A Super-Logarithmic Lower Bound for Implementing Butterfly Networks in the Nonatomic Binary Fork-Join Model , 2021, SODA.

[4]  Steven Skiena,et al.  Data Races and the Discrete Resource-time Tradeoff Problem with Resource Reuse over Paths , 2019, SPAA.

[5]  Steven G. Johnson,et al.  A Modified Split-Radix FFT With Fewer Arithmetic Operations , 2007, IEEE Transactions on Signal Processing.

[6]  Guy E. Blelloch,et al.  Effectively sharing a cache among threads , 2004, SPAA '04.

[7]  W. Donald Frazer,et al.  Samplesort: A Sampling Approach to Minimal Storage Tree Sorting , 1970, JACM.

[8]  Daniel N. Rockmore,et al.  The FFT: an algorithm the whole family can use , 2000, Comput. Sci. Eng..

[9]  Yan Gu,et al.  Efficient Stepping Algorithms and Implementations for Parallel Shortest Paths , 2021, SPAA.

[10]  Guy E. Blelloch,et al.  Parallel Algorithms for Asymmetric Read-Write Costs , 2016, SPAA.

[11]  Jack J. Dongarra,et al.  Guest Editors Introduction to the top 10 algorithms , 2000, Comput. Sci. Eng..

[12]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[13]  Roland C Le Bail,et al.  Use of fast fourier transforms for solving partial differential equations in physics , 1972 .

[14]  Richard Cole,et al.  Parallel merge sort , 1988, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[15]  Guy E. Blelloch,et al.  Low depth cache-oblivious algorithms , 2010, SPAA '10.

[16]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[17]  G. Bruun z-transform DFT filters and FFT's , 1978 .

[18]  S. Winograd On computing the Discrete Fourier Transform. , 1976, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[20]  Charles E. Leiserson,et al.  Space-efficient scheduling of multithreaded computations , 1993, SIAM J. Comput..

[21]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[22]  E. Shi,et al.  Data Oblivious Algorithms for Multicores , 2020, IACR Cryptol. ePrint Arch..

[23]  Richard Cole,et al.  Resource Oblivious Sorting on Multicores , 2010, ICALP.

[24]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[25]  Guy E. Blelloch,et al.  Scheduling irregular parallel computations on hierarchical caches , 2011, SPAA '11.

[26]  Irving John Good,et al.  The Interaction Algorithm and Practical Fourier Analysis , 1958 .

[27]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[28]  Victor Y. Pan,et al.  Strassen's algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).