Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "||" (parallel) and ";" (serial), that comprise the nested-parallel model are insufficient in expressing "partial dependencies" in a program. We propose a new dataflow composition construct "↝" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the Nested Dataflow (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and prove guarantees on their ability to balance nodes across processors and preserve locality. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is O((∑i=0h-1 Q*(t;σ⋅ Mi)⋅ Ci)/p) on a p-processor machine, where Q* is the parallel cache complexity of task t, Ci is the cost of cache miss at level-i cache which is of size Mi, and σ∈(0,1) is a constant.

[1]  Guy E. Blelloch,et al.  Pipelining with Futures , 1997, SPAA '97.

[2]  Guy E. Blelloch,et al.  Space-efficient scheduling of parallelism with synchronization variables , 1997, SPAA '97.

[3]  Guy E. Blelloch,et al.  A provably time-efficient parallel implementation of full speculation , 1999, TOPL.

[4]  Charles E. Leiserson,et al.  Deterministic parallel random-number generation for dynamic-multithreading platforms , 2012, PPoPP '12.

[5]  Zvi Galil,et al.  Parallel Algorithms for Dynamic Programming Recurrences with More than O(1) Dependency , 1994, J. Parallel Distributed Comput..

[6]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[7]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[8]  Silas Boyd-Wickizer,et al.  Using memory mapping to support cactus stacks in work-stealing runtime systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  Todd Mytkowicz,et al.  Parallelizing dynamic programming through rank convergence , 2014, PPoPP '14.

[10]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[11]  Ernie Chan,et al.  Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators FLAME Working Note # 50 , 2010 .

[12]  Robert A. van de Geijn,et al.  The science of deriving dense linear algebra algorithms , 2005, TOMS.

[13]  Guy E. Blelloch,et al.  Space-efficient scheduling for parallel, multithreaded computations , 1999 .

[14]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[15]  Charles E. Leiserson,et al.  Executing task graphs using work-stealing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[16]  Vijaya Ramachandran,et al.  Oblivious algorithms for multicores and network of processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[17]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[18]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1995, SPAA '95.

[19]  George Bosilca,et al.  Hierarchical DAG Scheduling for Hybrid Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[20]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[21]  Harsha Vardhan Simhadri,et al.  Program-Centric Cost Models for Locality and Parallelism , 2013 .

[22]  Haibin Kan,et al.  Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency , 2015, PPoPP.

[23]  Guy E. Blelloch,et al.  Scheduling irregular parallel computations on hierarchical caches , 2011, SPAA '11.

[24]  Guy E. Blelloch,et al.  Experimental Analysis of Space-Bounded Schedulers , 2016, ACM Trans. Parallel Comput..

[25]  Charles E. Leiserson,et al.  Space-efficient scheduling of multithreaded computations , 1993, SIAM J. Comput..

[26]  Jack Dongarra,et al.  Distibuted Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA , 2011 .

[27]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[28]  Guy E. Blelloch,et al.  Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures , 2009, SPAA '09.

[29]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[30]  Guy E. Blelloch,et al.  Low depth cache-oblivious algorithms , 2010, SPAA '10.

[31]  Michael A. Bender,et al.  Cache-Adaptive Algorithms , 2014, SODA.

[32]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[33]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[34]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[35]  Guy E. Blelloch,et al.  Effectively sharing a cache among threads , 2004, SPAA '04.

[36]  Guy E. Blelloch,et al.  The Data Locality of Work Stealing , 2002, SPAA '00.

[37]  Timothy A. Davis,et al.  A Concurrent Dynamic Task Graph , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[38]  Daniel P. Friedman,et al.  Aspects of Applicative Programming for Parallel Processing , 1978, IEEE Transactions on Computers.

[39]  Bowen Alpern,et al.  Modeling parallel computers as memory hierarchies , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[40]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[41]  Richard Cole,et al.  Efficient Resource Oblivious Algorithms for Multicores , 2011, ArXiv.

[42]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[43]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[44]  Carl Hewitt,et al.  The incremental garbage collection of processes , 1977, Artificial Intelligence and Programming Languages.

[45]  Maurice Herlihy,et al.  Well-Structured Futures and Cache Locality , 2013, PPoPP.

[46]  Charles E. Leiserson,et al.  On-the-Fly Pipeline Parallelism , 2015, ACM Trans. Parallel Comput..