Bounded memory scheduling of dynamic task graphs

It is now widely recognized that increased levels of parallelism is a necessary condition for improved application performance on multicore computers. However, as the number of cores increases, the memory-per-core ratio is expected to further decrease, making per-core memory efficiency of parallel programs an even more important concern in future systems. For many parallel applications, the memory requirements can be significantly larger than for their sequential counterparts and, more importantly, their memory utilization depends critically on the schedule used when running them. To address this problem we propose bounded memory scheduling (BMS) for parallel programs expressed as dynamic task graphs, in which an upper bound is imposed on the program's peak memory. Using the inspector/executor model, BMS tailors the set of allowable schedules to either guarantee that the program can be executed within the given memory bound, or throw an error during the inspector phase without running the computation if no feasible schedule can be found. Since solving BMS is NP-hard, we propose an approach in which we first use our heuristic algorithm, and if it fails we fall back on a more expensive optimal approach which is sped up by the best-effort result of the heuristic. Through evaluation on seven benchmarks, we show that BMS gracefully spans the spectrum between fully parallel and serial execution with decreasing memory bounds. Comparison with OpenMP shows that BMS-CnC can execute in 53% of the memory required by OpenMP while running at 90% (or more) of OpenMP's performance.

[1]  Brian Campbell,et al.  Amortised Memory Analysis Using the Depth of Data Structures , 2009, ESOP.

[2]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999, JACM.

[3]  Sergio Yovine,et al.  Parametric prediction of heap memory requirements , 2008, ISMM '08.

[4]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[5]  Kent Wilken,et al.  Optimal instruction scheduling using integer programming , 2000, PLDI.

[6]  Philippe Flajolet,et al.  The Number of Registers Required for Evaluating Arithmetic Expressions , 1979, Theor. Comput. Sci..

[7]  Guy E. Blelloch,et al.  Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures , 2009, SPAA '09.

[8]  Junfeng Yang,et al.  Stable Deterministic Multithreading through Schedule Memoization , 2010, OSDI.

[9]  Rajeev Thakur,et al.  Poster: Memory-Conscious Collective I/O for Extreme-Scale HPC Systems , 2012, SC Companion.

[10]  Peter van Beek,et al.  Fast Optimal Instruction Scheduling for Single-Issue Processors with Arbitrary Latencies , 2001, CP.

[11]  F. Warren Burton Guaranteeing Good Memory Bound for Parallel Programs , 1996, IEEE Trans. Software Eng..

[12]  Jens Palsberg,et al.  Concurrent Collections , 2010 .

[13]  Tao Yang,et al.  Run-time compilation for parallel sparse matrix computations , 1996, ICS '96.

[14]  Peter van Beek,et al.  Optimal Basic Block Instruction Scheduling for Multiple-Issue Processors Using Constraint Programming , 2008, Int. J. Artif. Intell. Tools.

[15]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[16]  Jens Palsberg,et al.  Concurrent Collections , 2010, Sci. Program..

[17]  Sid Ahmed Ali Touati,et al.  Register Saturation in Superscalar and VLIW Codes , 2001, CC.

[18]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[19]  Hironori Kasahara,et al.  A standard task graph set for fair evaluation of multiprocessor scheduling algorithms , 2002 .

[20]  Christoph Kessler,et al.  Integer Linear Programming versus Dynamic Programming for Optimal Integrated VLIW Code Generation , 2006 .

[21]  Martin Hofmann,et al.  Static prediction of heap space usage for first-order functional programs , 2003, POPL '03.

[22]  F. Warren Burton,et al.  Space Efficient Execution of Deterministic Parallel Programs , 1999, IEEE Trans. Software Eng..

[23]  Martin Hofmann,et al.  Efficient Type-Checking for Amortised Heap-Space Analysis , 2009, CSL.

[24]  Rajiv Gupta,et al.  Integrated Instruction Scheduling and Register Allocation Techniques , 1998, LCPC.

[25]  Peter Kulchyski and , 2015 .

[26]  Joel H. Saltz,et al.  Run-Time Parallelization and Scheduling of Loops , 1991, IEEE Trans. Computers.

[27]  Laxmikant V. Kalé,et al.  A study of memory-aware scheduling in message driven parallel programs , 2010, 2010 International Conference on High Performance Computing.

[28]  Guang R. Gao,et al.  Minimum Register Instruction Sequencing to Reduce Register Spills in Out-of-Order Issue Superscalar Architectures , 2003, IEEE Trans. Computers.

[29]  Peter van Beek,et al.  Optimal Basic Block Instruction Scheduling for Multiple-Issue Processors Using Constraing Programming , 2006, ICTAI.

[30]  Vivek Sarkar,et al.  Combining Register Allocation and Instruction Scheduling , 1995 .

[31]  Guy E. Blelloch,et al.  Space-efficient scheduling of parallelism with synchronization variables , 1997, SPAA '97.

[32]  Lori L. Pollock,et al.  A scheduler-sensitive global register allocator , 1993, Supercomputing '93. Proceedings.

[33]  C. Norris,et al.  A schedular-sensitive global register allocator , 1993, Supercomputing '93.

[34]  Vivek Sarkar,et al.  BMS-CnC: Bounded Memory Scheduling of Dynamic Task Graphs , 2014 .

[35]  Chung-Ta King,et al.  Using integer linear programming for instruction scheduling and register allocation in multi-issue processors , 1997 .

[36]  Richard C. Larson,et al.  Model Building in Mathematical Programming , 1979 .

[37]  Panagiota Fatourou,et al.  Low-contention depth-first scheduling of parallel computations with write-once synchronization variables , 2001, SPAA '01.

[38]  Andreas Krall,et al.  Optimal and Heuristic Global Code Motion for Minimal Spilling , 2013, CC.

[39]  Vivek Sarkar,et al.  Folding of Tagged Single Assignment Values for Memory-Efficient Parallelism , 2012, Euro-Par.

[40]  Roberto Castañeda Lozano,et al.  Constraint-Based Register Allocation and Instruction Scheduling , 2012, CP.

[41]  H. P. Williams,et al.  Model Building in Mathematical Programming , 1979 .

[42]  Guy E. Blelloch,et al.  Space-efficient scheduling of nested parallelism , 1999, TOPL.

[43]  Vivek Sarkar,et al.  The Flexible Preconditions Model for Macro-Dataflow Execution , 2013, 2013 Data-Flow Execution Models for Extreme Scale Computing.

[44]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[45]  Shlomit S. Pinter,et al.  Register allocation with instruction scheduling , 1993, PLDI '93.

[46]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[47]  E BlellochGuy,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999 .

[48]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.