Dynamic Scheduling of Irregular Stream Programs toward Many-Core Scalability

The stream programming model has received much interest because it naturally exposes task, data, and pipeline parallelism. However, most priorwork has focused on the static scheduling of regular stream programs. Therefore, irregular applications cannot be handled in static scheduling, and the load imbalance caused by static scheduling faces scalability limitations in many-core systems. In this paper, we introduce the DANBI programming model, which supports irregular stream programs, and propose dynamic scheduling techniques. Scheduling irregular stream programs is very challenging, and the load imbalance becomes a major hurdle to achieving scalability. Our dynamic load-balancing scheduler exploits producer-consumer relationships already expressed in the DANBI program to achieve scalability. Moreover, it effectively avoids the thundering-herd problem and dynamically adapts to load imbalance in a probabilistic manner. It surpasses prior static stream scheduling approaches which are vulnerable to load imbalance and also surpasses prior dynamic stream scheduling approaches which result in many restrictions on supported program types, on the scope of dynamic scheduling, and on data ordering preservation. Our experimental results on a 40-core server show that DANBI achieves an almost linear scalability and outperforms state-of-the-art parallel runtimes by up to 2.8 times.

[1]  Thomas R. Gross,et al.  Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead , 2011, ISMM '11.

[2]  Selim G. Akl,et al.  Optimal Parallel Merging and Sorting Without Memory Conflicts , 1987, IEEE Transactions on Computers.

[3]  Luca P. Carloni,et al.  Flexible filters: load balancing through backpressure for stream programs , 2009, EMSOFT '09.

[4]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[5]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[6]  Sriram Krishnamoorthy,et al.  Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing , 2008, 2008 37th International Conference on Parallel Processing.

[7]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[8]  William Thies,et al.  An empirical characterization of stream programs and its implications for language and compiler design , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[10]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[11]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[12]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[13]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Albert Cohen,et al.  OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs , 2012, TACO.

[15]  Kun-Lung Wu,et al.  Elastic scaling of data parallel operators in stream processing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  Jong-Deok Choi,et al.  An OpenCL framework for heterogeneous multicores with local memory , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  William J. Dally,et al.  Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures , 2010, SPAA '10.

[18]  Arun Raman,et al.  Parallelism orchestration using DoPE: the degree of parallelism executive , 2011, PLDI '11.

[19]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[20]  Anjul Patney,et al.  Task management for irregular-parallel workloads on the GPU , 2010, HPG '10.

[21]  Yale N. Patt,et al.  Feedback-directed pipeline parallelism , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[23]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[24]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[25]  Michael I. Gordon Compiler techniques for scalable performance of stream programs on multicore architectures , 2010 .

[26]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[27]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[28]  Dirk Grunwald,et al.  Generating, optimizing, and scheduling a compiler level representation of stream parallelism , 2011 .

[29]  Navendu Jain,et al.  Adaptive Control of Extreme-scale Stream Processing Systems , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[30]  Robert Morris,et al.  Non-scalable locks are dangerous , 2012 .

[31]  Pat Hanrahan,et al.  GRAMPS: A programming model for graphics pipelines , 2009, ACM Trans. Graph..

[32]  Shreekant S. Thakkar,et al.  Synchronization algorithms for shared-memory multiprocessors , 1990, Computer.

[33]  Christoforos E. Kozyrakis,et al.  Dynamic Fine-Grain Scheduling of Pipeline Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[34]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..