Lazy tree splitting

Nested data-parallelism (NDP) is a declarative style for programming irregular parallel applications. NDP languages provide language features favoring the NDP style, efficient compilation of NDP programs, and various common NDP operations like parallel maps, filters, and sum-like reductions. In this paper, we describe the implementation of NDP in Parallel ML (PML), part of the Manticore project. Managing the parallel decomposition of work is one of the main challenges of implementing NDP. If the decomposition creates too many small chunks of work, performance will be eroded by too much parallel overhead. If, on the other hand, there are too few large chunks of work, there will be too much sequential processing and processors will sit idle. Recently the technique of Lazy Binary Splitting was proposed for dynamic parallel decomposition of work on flat arrays, with promising results. We adapt Lazy Binary Splitting to parallel processing of binary trees, which we use to represent parallel arrays in PML. We call our technique Lazy Tree Splitting (LTS). One of its main advantages is its performance robustness: per-program tuning is not required to achieve good performance across varying platforms. We describe LTS-based implementations of standard NDP operations, and we present experimental data demonstrating the scalability of LTS across a range of benchmarks.

[1]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[2]  Hans-Juergen Boehm,et al.  Ropes: An alternative to strings , 1995, Softw. Pract. Exp..

[3]  H. Plummer On the Problem of Distribution in Globular Star Clusters: (Plate 8.) , 1911 .

[4]  Jesse Fang,et al.  A Flexible Parallel Programming Model for Tera-scale Architectures Table of Contents , 2007 .

[5]  John H. Reppy,et al.  Manticore: a heterogeneous parallel language , 2007, DAMP '07.

[6]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[7]  Michael Voss,et al.  Optimization via Reflection on Work Stealing in TBB , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[8]  Andrew W. Appel,et al.  Compiling with Continuations , 1991 .

[9]  Stephen Weeks,et al.  Whole-program compilation in MLton , 2006, ML '06.

[10]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[11]  Youfeng Wu,et al.  Optimizing Data Parallel Operations on Many-Core Platforms , 2006 .

[12]  John H. Reppy,et al.  A scheduling framework for general-purpose parallel languages , 2008, ICFP.

[13]  Ralf Hinze,et al.  Finger trees: a simple general-purpose data structure , 2005, Journal of Functional Programming.

[14]  Siddhartha Chatterjee,et al.  Compiling nested data-parallel programs for shared-memory multiprocessors , 1993, TOPL.

[15]  David Clark Laboratory for Computer Science , 1995 .

[16]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[17]  P. Murdin MONTHLY NOTICES OF THE ROYAL ASTRONOMICAL SOCIETY , 2005 .

[18]  Saumya K. Debray,et al.  A Methodology for Granularity-Based Control of Parallelism in Logic Programs , 1996, J. Symb. Comput..

[19]  Robert H. Halstead,et al.  Implementation of multilisp: Lisp on a multiprocessor , 1984, LFP '84.

[20]  Guy E. Blelloch,et al.  Space-efficient scheduling of nested parallelism , 1999, TOPL.

[21]  Charles E. Leiserson The Cilk++ concurrency platform , 2009, DAC.

[22]  Simon L. Peyton Jones,et al.  Harnessing the Multicores: Nested Data Parallelism in Haskell , 2008, FSTTCS.

[23]  Guy E. Blelloch,et al.  A provable time and space efficient implementation of NESL , 1996, ICFP '96.

[24]  Hans-Wolfgang Loidl,et al.  On the Granularity of Divide-and-Conquer Parallelism , 1995, Functional Programming.

[25]  Simon Peyton Jones,et al.  Partial Vectorisation of Haskell Programs , 2008 .

[26]  Mary Van Deusen,et al.  Red Language Reference Manual , 1979 .

[27]  John H. Reppy,et al.  Status report: the manticore project , 2007, ML '07.

[28]  Andrew W. Appel,et al.  Simple generational garbage collection and fast allocation , 1989, Softw. Pract. Exp..

[29]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[30]  Simon L. Peyton Jones,et al.  Data parallel Haskell: a status report , 2007, DAMP '07.

[31]  John H. Reppy,et al.  Implicitly-threaded parallelism in Manticore , 2008, ICFP 2008.

[32]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[33]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[34]  Roman Leshchinskiy,et al.  Higher order nested data parallelism: semantics and implementation , 2006 .

[35]  Guy E. Blelloch,et al.  Vector Models for Data-Parallel Computing , 1990 .

[36]  Gabriele Keller Transformation-based implementation of nested data parallelis for distributed memory machines , 1999 .

[37]  Hans-Wolfgang Loidl,et al.  Algorithm + strategy = parallelism , 1998, Journal of Functional Programming.

[38]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[39]  Robin Milner,et al.  Definition of standard ML , 1990 .

[40]  Evan Tick,et al.  A compile-time granularity analysis algorithm and its performance evaluation , 1993, New Generation Computing.

[41]  Alexandros Tzannes,et al.  Lazy binary-splitting: a run-time adaptive work-stealing scheduler , 2010, PPoPP '10.

[42]  Gérard P. Huet,et al.  The Zipper , 1997, Journal of Functional Programming.

[43]  Conor McBride Clowns to the left of me, jokers to the right (pearl): dissecting data structures , 2008, POPL '08.