PackStealLB: A scalable distributed load balancer based on work stealing and workload discretization

The scalability of high-performance, parallel iterative applications is directly affected by how well they use the available computing resources. These applications are subject to load imbalance due to the nature and dynamics of their computations. It is common that high performance systems employ periodic load balancing to tackle this issue. Dynamic load balancing algorithms redistribute the application’s workload using heuristics to circumvent the NP-hard complexity of the problem However, scheduling heuristics must be fast to avoid hindering application performance when distributing the workload on large and distributed environments. In this work, we present a technique for low overhead, high quality scheduling decisions for parallel iterative applications. The technique relies on combined application workload information paired with distributed scheduling algorithms. An initial distributed step among scheduling agents group application tasks in packs of similar load to minimize messages among them. This information is used by our scheduling algorithm, PackStealLB, for its distributed-memory work stealing heuristic. Experimental results showed that PackStealLB is able to improve the performance of a molecular dynamics benchmark by up to 41%, outperforming other scheduling algorithms in most scenarios over almost one thousand cores.

[1]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  François Pellegrini,et al.  PT-Scotch: A tool for efficient parallel graph ordering , 2008, Parallel Comput..

[3]  Minyi Guo,et al.  Contention and Locality-Aware Work-Stealing for Iterative Applications in Multi-Socket Computers , 2018, IEEE Transactions on Computers.

[4]  Mehmet Deveci,et al.  Fast and High Quality Topology-Aware Task Mapping , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[5]  Michael Garland,et al.  Dynamic Tracing: Memoization of Task Graphs for Dynamic Task-Based Runtimes , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Jixiang Yang,et al.  Scheduling Parallel Computations by Work Stealing: A Survey , 2018, International Journal of Parallel Programming.

[7]  Philippe Olivier Alexandre Navaux,et al.  MigPF: Towards on self-organizing process rescheduling of Bulk-Synchronous Parallel applications , 2018, Future Gener. Comput. Syst..

[8]  Laxmikant V. Kalé,et al.  Improving the memory access locality of hybrid MPI applications , 2017, EuroMPI/USA.

[9]  Scott Shenker,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[10]  Emmanuel Jeannot,et al.  Communication and topology-aware load balancing in Charm++ with TreeMatch , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[11]  Laxmikant V. Kalé,et al.  Periodic hierarchical load balancing for large supercomputers , 2011, Int. J. High Perform. Comput. Appl..

[12]  Nancy M. Amato,et al.  Quantifying the effectiveness of load balance algorithms , 2012, ICS '12.

[13]  Jesús Labarta,et al.  Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing , 2015, Supercomput. Front. Innov..

[14]  Paul W. Goldberg,et al.  Distributed selfish load balancing , 2005, SODA '06.

[15]  David S. Johnson,et al.  `` Strong '' NP-Completeness Results: Motivation, Examples, and Implications , 1978, JACM.

[16]  Jan Karel Lenstra,et al.  Complexity of machine scheduling problems , 1975 .

[17]  Francisco Almeida,et al.  A Dynamic Multi–Objective Approach for Dynamic Load Balancing in Heterogeneous Systems , 2020, IEEE Transactions on Parallel and Distributed Systems.

[18]  Alexey Lastovetsky,et al.  A Novel Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms , 2018, IEEE Transactions on Parallel and Distributed Systems.

[19]  Laxmikant V. Kalé,et al.  A distributed dynamic load balancer for iterative applications , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Ümit V. Çatalyürek,et al.  Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[21]  Bruno Raffin,et al.  A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores , 2010, Euro-Par Workshops.

[22]  Vladimir Janjic,et al.  How to be a Successful Thief - Feudal Work Stealing for Irregular Divide-and-Conquer Applications on Heterogeneous Distributed Systems , 2013, Euro-Par.

[23]  Laxmikant V. Kale,et al.  Applying graph partitioning methods in measurement-based dynamic load balancing , 2011 .

[24]  Kostas Katrinis,et al.  A taxonomy of task-based parallel programming technologies for high-performance computing , 2018, The Journal of Supercomputing.

[25]  Michael P. Wellman,et al.  Distributed quiescence detection in multiagent negotiation , 2000, Proceedings Fourth International Conference on MultiAgent Systems.

[26]  Laxmikant V. Kalé,et al.  Automated Load Balancing Invocation Based on Application Characteristics , 2012, 2012 IEEE International Conference on Cluster Computing.

[27]  Jean-François Méhaut,et al.  A comprehensive performance evaluation of the BinLPT workload‐aware loop scheduler , 2019, Concurr. Comput. Pract. Exp..

[28]  Laxmikant V. Kalé,et al.  A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems , 2012, 2012 41st International Conference on Parallel Processing.

[29]  George Karypis,et al.  Multi-threaded Graph Partitioning , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[30]  Laxmikant V. Kalé,et al.  Work stealing and persistence-based load balancers for iterative overdecomposed applications , 2012, HPDC '12.

[31]  Anthony P. Reeves,et al.  Strategies for Dynamic Load Balancing on Highly Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[32]  Arthur L. Liestman,et al.  A survey of gossiping and broadcasting in communication networks , 1988, Networks.

[33]  Bradley C. Kuszmaul,et al.  Massively Parallel Chess , 1994 .

[34]  Laxmikant V. Kalé,et al.  Optimizing Data Locality for Fork/Join Programs Using Constrained Work Stealing , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Peter Brucker,et al.  Scheduling Algorithms , 1995 .

[36]  Laércio Lima Pilla,et al.  A Batch Task Migration Approach for Decentralized Global Rescheduling , 2018, 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[37]  Adrian Kosowski,et al.  Improved Analysis of Deterministic Load-Balancing Schemes , 2015, PODC.

[38]  Laxmikant V. Kalé,et al.  Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[39]  Petra Berenbrink,et al.  Tight & Simple Load Balancing , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[40]  E.L. Lawler,et al.  Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey , 1977 .

[41]  Jarek Nabrzyski,et al.  Reducing Fragmentation on 3D Torus-Based HPC Systems Using Packing-Based Job Scheduling and Job Placement Reconfiguration , 2017, 2017 16th International Symposium on Parallel and Distributed Computing (ISPDC).

[42]  Wolfgang E. Nagel,et al.  The Potential of Diffusive Load Balancing at Large Scale , 2016, EuroMPI.

[43]  José Nelson Amaral,et al.  On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks , 2013, 2013 42nd International Conference on Parallel Processing.

[44]  Ajay D. Kshemkalyani,et al.  Distributed Computing: Principles, Algorithms, and Systems , 2008 .

[45]  Abhishek Gupta,et al.  Parallel Programming with Migratable Objects: Charm++ in Practice , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[46]  John Shalf,et al.  Trends in Data Locality Abstractions for HPC Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.