GLB: lifeline-based global load balancing library in x10

We present GLB, a programming model and an associated implementation that can handle a wide range of irregular parallel programming problems running over large-scale distributed systems. GLB is applicable both to problems that are easily load-balanced via static scheduling and to problems that are hard to statically load balance. GLB hides the intricate synchronizations (e.g., inter-node communication, initialization and startup, load balancing, termination and result collection) from the users. GLB internally uses a version of the lifeline graph based work-stealing algorithm proposed by Saraswat et al [25]. Users of GLB are simply required to write several pieces of sequential code that comply with the GLB interface. GLB then schedules and orchestrates the parallel execution of the code correctly and efficiently at scale. We have applied GLB to two representative benchmarks: Betweenness Centrality (BC) and Unbalanced Tree Search (UTS). Among them, BC can be statically load-balanced whereas UTS cannot. In either case, GLB scales well -- achieving nearly linear speedup on different computer architectures (Power, Blue Gene/Q, and K) -- up to 16K cores.

[1]  David Cunningham,et al.  X10 and APGAS at Petascale , 2016, ACM Trans. Parallel Comput..

[2]  Robert D. Blumofe,et al.  Adaptive and Reliable ParallelComputing9 Networks of Workstations , 1997 .

[3]  Stephen L. Olivier,et al.  Scalable Dynamic Load Balancing Using UPC , 2008, 2008 37th International Conference on Parallel Processing.

[4]  Vipin Kumar,et al.  Scalable Load Balancing Techniques for Parallel Computers , 1994, J. Parallel Distributed Comput..

[5]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[6]  Sriram Krishnamoorthy,et al.  Lifeline-based global load balancing , 2011, PPoPP '11.

[7]  Yi Guo,et al.  Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[8]  Henri E. Bal,et al.  Efficient load balancing for wide-area divide-and-conquer applications , 2001, PPoPP '01.

[9]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[10]  W. Marsden I and J , 2012 .

[11]  Olivier Tardieu,et al.  A work-stealing scheduler for X10's task parallelism with suspension , 2012, PPoPP '12.

[12]  Sriram Krishnamoorthy,et al.  Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing , 2008, 2008 37th International Conference on Parallel Processing.

[13]  Laxmikant V. Kalé,et al.  A load balancing strategy for prioritized execution of tasks , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[14]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[15]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[16]  Sreedhar B. Kodali,et al.  The Asynchronous Partitioned Global Address Space Model , 2010 .

[17]  Eric A. Brewer,et al.  ATLAS: an infrastructure for global computing , 1996, EW 7.

[18]  C. Tseng,et al.  UPC Implementation of an Unbalanced Tree Search Benchmark , 2003 .

[19]  Vipin Kumar,et al.  State of the Art in Parallel Search Techniques for Discrete Optimization Problems , 1999, IEEE Trans. Knowl. Data Eng..

[20]  John M. Mellor-Crummey,et al.  Managing Asynchronous Operations in Coarray Fortran 2.0 , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[21]  Magdalena Balazinska,et al.  Estimating the progress of MapReduce pipelines , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[22]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[25]  Radha Jagadeesan,et al.  Concurrent Clustered Programming , 2005, CONCUR.