Optimal Parallel Algorithms in the Binary-Forking Model

In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference. In the binary-forking model, tasks can only fork into two child tasks, but can do so recursively and asynchronously. The tasks share memory, supporting reads, writes and test-and-sets. Costs are measured in terms of work (total number of instructions), and span (longest dependence chain). The binary-forking model is meant to capture both algorithm performance and algorithm-design considerations on many existing multithreaded languages, which are also asynchronous and rely on binary forks either explicitly or under the covers. In contrast to the widely studied PRAM model, it does not assume arbitrary-way forks nor synchronous operations, both of which are hard to implement in modern hardware. While optimal PRAM algorithms are known for the problems studied herein, it turns out that arbitrary-way forking and strict synchronization are powerful, if unrealistic, capabilities. Natural simulations of these PRAM algorithms in the binary-forking model (i.e., implementations in existing parallel languages) incur an Ω(log n) overhead in span. This paper explores techniques for designing optimal algorithms when limited to binary forking and assuming asynchrony. All algorithms described in this paper are the first algorithms with optimal work and span in the binary-forking model. Most of the algorithms are simple. Many are randomized.

[1]  Guy E. Blelloch,et al.  Randomized Incremental Convex Hull is Highly Parallel , 2020, SPAA.

[2]  Leslie G. Valiant,et al.  General Purpose Parallel Architectures , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[3]  Guy E. Blelloch,et al.  Internally deterministic parallel algorithms can be fast , 2012, PPoPP '12.

[4]  Artur Czumaj,et al.  Fast Generation of Random Permutations via Networks Simulation , 1996, ESA.

[5]  Richard Cole,et al.  Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers: Extended Abstract , 2017, SPAA.

[6]  Uzi Vishkin,et al.  Randomized speed-ups in parallel computation , 2015, STOC '84.

[7]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999, JACM.

[8]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[9]  Richard Cole,et al.  Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms , 1986, STOC '86.

[10]  Uzi Vishkin,et al.  Towards a theory of nearly constant time parallel algorithms , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[11]  Vivek Sarkar,et al.  The design and implementation of the habanero-java parallel programming language , 2011, OOPSLA Companion.

[12]  Allan Borodin,et al.  On Relating Time and Space to Size and Depth , 1977, SIAM J. Comput..

[13]  Donald Ervin Knuth,et al.  The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[14]  Peter Sanders,et al.  On (Dynamic) Range Minimum Queries in External Memory , 2013, WADS.

[15]  Richard Cole,et al.  Resource Oblivious Sorting on Multicores , 2010, ICALP.

[16]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[17]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[18]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[19]  Guy E. Blelloch,et al.  Effectively sharing a cache among threads , 2004, SPAA '04.

[20]  Erez Petrank,et al.  A lock-free B+tree , 2012, SPAA '12.

[21]  Uzi Vishkin,et al.  Parallel Dictionaries in 2-3 Trees , 1983, ICALP.

[22]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language , 1992 .

[23]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[24]  Robert E. Tarjan,et al.  Making Data Structures Persistent , 1989, J. Comput. Syst. Sci..

[25]  B GibbonsPhillip,et al.  Efficient Low-Contention Parallel Algorithms , 1996 .

[26]  Stephen Alstrup,et al.  Nearest Common Ancestors: A Survey and a New Algorithm for a Distributed Environment , 2004, Theory of Computing Systems.

[27]  Leslie G. Valiant,et al.  A logarithmic time sort for linear size networks , 1982, STOC.

[28]  Guy E. Blelloch,et al.  Efficient Algorithms with Asymmetric Read and Write Costs , 2015, ESA.

[29]  Guy E. Blelloch,et al.  Parallelism in Randomized Incremental Algorithms , 2018, J. ACM.

[30]  Richard Cole,et al.  Approximate Parallel Scheduling. Part I: The Basic Technique with Applications to Optimal Parallel List Ranking in Logarithmic Time , 1988, SIAM J. Comput..

[31]  Guy E. Blelloch,et al.  Fast set operations using treaps , 1998, SPAA '98.

[32]  Gary L. Miller,et al.  List ranking and parallel tree contraction , 1993 .

[33]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[34]  Guy E. Blelloch,et al.  Implicit Decomposition for Write-Efficient Connectivity Algorithms , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[35]  Abhiram G. Ranade,et al.  A simple optimal list ranking algorithm , 1998, Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238).

[36]  Guy E. Blelloch,et al.  Parallel and I/O efficient set covering algorithms , 2012, SPAA '12.

[37]  J. Wrench Table errata: The art of computer programming, Vol. 2: Seminumerical algorithms (Addison-Wesley, Reading, Mass., 1969) by Donald E. Knuth , 1970 .

[38]  Laurent Alonso,et al.  A Parallel Algorithm for the Generation of a Permutation and Applications , 1996, Theor. Comput. Sci..

[39]  Qin Zhang,et al.  Sorting, Searching, and Simulation in the MapReduce Framework , 2011, ISAAC.

[40]  Gary L. Miller,et al.  A Simple Randomized Parallel Algorithm for List-Ranking , 1990, Inf. Process. Lett..

[41]  Gary L. Miller,et al.  Parallel tree contraction and its application , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[42]  Guy E. Blelloch,et al.  Scheduling irregular parallel computations on hierarchical caches , 2011, SPAA '11.

[43]  Robert E. Tarjan,et al.  A Fast Merging Algorithm , 1979, JACM.

[44]  Guy E. Blelloch,et al.  PAM: parallel augmented maps , 2016, PPoPP.

[45]  Guy E. Blelloch,et al.  A Top-Down Parallel Semisort , 2015, SPAA.

[46]  Guy E. Blelloch,et al.  Semi-Asymmetric Parallel Graph Algorithms for NVRAMs , 2019, ArXiv.

[47]  David A. Bader,et al.  An Empirical Analysis of Parallel Random Permutation Algorithms ON SMPs , 2006, PDCS.

[48]  Arthur Charguéraud,et al.  Heartbeat scheduling: provable efficiency for nested parallelism , 2018, PLDI.

[49]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[50]  Charles E. Leiserson,et al.  Space-Efficient Scheduling of Multithreaded Computations , 1998, SIAM J. Comput..

[51]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[52]  Vijaya Ramachandran,et al.  Oblivious algorithms for multicores and network of processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[53]  Guy E. Blelloch,et al.  Brief announcement: the problem based benchmark suite , 2012, SPAA '12.

[54]  John H. Reif,et al.  Synthesis of Parallel Algorithms , 1993 .

[55]  Kunsoo Park,et al.  Parallel algorithms for red-black trees , 2001, Theor. Comput. Sci..

[56]  Volker Heun,et al.  Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE , 2006, CPM.

[57]  Yossi Matias,et al.  Efficient low-contention parallel algorithms , 1994, SPAA '94.

[58]  Panagiota Fatourou,et al.  Persistent Non-Blocking Binary Search Trees Supporting Wait-Free Range Queries , 2018, SPAA.

[59]  Haibin Kan,et al.  Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency , 2015, PPoPP.

[60]  Seth Gilbert,et al.  Parallel Finger Search Structures , 2019, DISC.

[61]  Uzi Vishkin,et al.  Finding the Maximum, Merging, and Sorting in a Parallel Computation Model , 1981, J. Algorithms.

[62]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[63]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[64]  Uzi Vishkin,et al.  Recursive Star-Tree Parallel Data Structure , 1993, SIAM J. Comput..

[65]  Guy E. Blelloch,et al.  Sorting with Asymmetric Read and Write Costs , 2015, SPAA.

[66]  Joseph Gil,et al.  Fast load balancing on a PRAM , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[67]  Guy E. Blelloch,et al.  Just Join for Parallel Ordered Sets , 2016, SPAA.

[68]  James Christopher Wyllie,et al.  The Complexity of Parallel Computations , 1979 .

[69]  Guy E. Blelloch,et al.  Low depth cache-oblivious algorithms , 2010, SPAA '10.

[70]  Michael A. Bender,et al.  Small Refinements to the DAM Can Have Big Consequences for Data-Structure Design , 2019, SPAA.

[71]  Guy E. Blelloch,et al.  Pipelining with Futures , 1997, SPAA '97.

[72]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[73]  Guy E. Blelloch,et al.  Parallel Write-Efficient Algorithms and Data Structures for Computational Geometry , 2018, SPAA.

[74]  James Demmel,et al.  Write-Avoiding Algorithms , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[75]  Richard Cole,et al.  Parallel merge sort , 1988, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[76]  Richard M. Karp,et al.  Parallel Algorithms for Shared-Memory Machines , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[77]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[78]  Julian Shun,et al.  Shared-Memory Parallelism Can be Simple, Fast, and Scalable , 2017 .

[79]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[80]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[81]  Richard Cole,et al.  Faster Optimal Parallel Prefix Sums and List Ranking , 2011, Inf. Comput..

[82]  Ronald L. Rivest,et al.  Introduction to Algorithms, 3rd Edition , 2009 .

[83]  Richard J. Anderson Parallel algorithms for generating random permutations on a shared memory machine , 1990, SPAA '90.

[84]  Torben Hagerup Fast Parallel Generation of Random Permutations , 1991, ICALP.

[85]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[86]  Seth Gilbert,et al.  Parallel Working-Set Search Structures , 2018, SPAA.

[87]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[88]  Victor Luchangco,et al.  BQ: A Lock-Free Queue with Batching , 2018, SPAA.

[89]  Harsha Vardhan Simhadri,et al.  Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers , 2016, SPAA.

[90]  Robert E. Tarjan,et al.  Design and Analysis of a Data Structure for Representing Sorted Lists , 1978, SIAM J. Comput..

[91]  Frank K. Hwang,et al.  A Simple Algorithm for Merging Two Disjoint Linearly-Ordered Sets , 1972, SIAM J. Comput..

[92]  Edward M. Reingold,et al.  Binary Search Trees of Bounded Balance , 1973, SIAM J. Comput..

[93]  Nodari Sitchinava,et al.  Lower Bounds in the Asymmetric External Memory Model , 2017, SPAA.

[94]  Jens Gustedt,et al.  Randomized permutations in a coarse grained parallel environment , 2003, SPAA '03.

[95]  E. Szemerédi,et al.  O(n LOG n) SORTING NETWORK. , 1983 .

[96]  Artur Czumaj,et al.  Fast Generation of Random Permutations Via Networks Simulation , 1998, Algorithmica.

[97]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[98]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[99]  Guy E. Blelloch,et al.  Parallel Algorithms for Asymmetric Read-Write Costs , 2016, SPAA.

[100]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[101]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[102]  Uzi Vishkin,et al.  Optimal parallel approximation for prefix sums and integer sorting , 1994, SODA '94.

[103]  W. Donald Frazer,et al.  Samplesort: A Sampling Approach to Minimal Storage Tree Sorting , 1970, JACM.

[104]  Guy E. Blelloch,et al.  Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable , 2018, SPAA.

[105]  Sanguthevar Rajasekaran,et al.  Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms , 1989, SIAM J. Comput..

[106]  Yuan Tang,et al.  Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms , 2017, SPAA.

[107]  Uzi Vishkin,et al.  Constant Depth Reducibility , 1984, SIAM J. Comput..

[108]  Peter Sanders,et al.  Fast Parallel Operations on Search Trees , 2015, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[109]  Guy E. Blelloch,et al.  Sequential Random Permutation, List Contraction and Tree Contraction are Highly Parallel , 2015, SODA.

[110]  Nodari Sitchinava,et al.  On the Complexity of List Ranking in the Parallel External Memory Model , 2014, MFCS.

[111]  Jens Gustedt Engineering Parallel In-Place Random Generation of Integer Permutations , 2008, WEA.