Are Static Schedules so Bad? A Case Study on Cholesky Factorization

Our goal is to provide an analysis and comparison of static and dynamic strategies for task graph scheduling on platforms consisting of heterogeneous and unrelated resources, such as GPUs and CPUs. Static scheduling strategies, that have been used for years, suffer several weaknesses. First, it is well known that underlying optimization problems are NP-Complete, what limits the capability of finding optimal solutions to small cases. Second, parallelism into processing nodes makes it difficult to precisely predict the performance of both communications and computations, due to shared resources and co-scheduling effects. Recently, to cope with this limitations, many dynamic task-graph based runtime schedulers (StarPU, StarSs, QUARK, PaRSEC, ) have been proposed. Dynamic schedulers base their allocation and scheduling decisions on the one side on dynamic information such as the set of available tasks, the location of data and the state of the resources and on the other hand on static information such as task priorities computed from the whole task graph. Our analysis is deep but we concentrate on a single kernel, namely Cholesky factorization of dense matrices on platforms consisting of GPUs and CPUs. This application encompasses many important characteristics in our context. Indeed it consists in a phase where the number of available tasks if large, where the careful use of resources is critical, and in a phase with few tasks available, where the choice of the task to be executed is crucial. In this paper, we analyze the performance of static and dynamic strategies and we propose a set of intermediate strategies, by adding more static (resp. dynamic) features into dynamic (resp. static) strategies. Our conclusions are somehow unexpected in the sense that we prove that static-based strategies are very efficient, even in a context where performance estimations are not very good.

[1]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[2]  Jean-François Méhaut,et al.  Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-core Architectures , 2014, Euro-Par.

[3]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[4]  Emmanuel Agullo,et al.  Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[5]  Robert A. van de Geijn,et al.  SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.

[6]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[7]  Oliver Sinnen,et al.  Scheduling task graphs optimally with A* , 2010, The Journal of Supercomputing.

[8]  Henri Casanova,et al.  SimGrid: A Generic Framework for Large-Scale Distributed Experiments , 2008, Tenth International Conference on Computer Modeling and Simulation (uksim 2008).

[9]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[10]  Julien Langou,et al.  A Critical Path Approach to Analyzing Parallelism of Algorithmic Variants. Application to Cholesky Inversion , 2010, ArXiv.

[11]  Jack Dongarra,et al.  Faster, Cheaper, Better { a Hybridization Methodology to Develop Linear Algebra Software for GPUs , 2010 .

[12]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[13]  Eduard Ayguadé,et al.  Exploiting asynchrony from exact forward recovery for DUE in iterative solvers , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  George Bosilca,et al.  PaRSEC : A programming paradigm exploiting heterogeneity for enhancing scalability , 2013 .

[15]  George Bosilca,et al.  Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project , 2010 .

[16]  Emmanuel Agullo,et al.  Task-Based FMM for Multicore Architectures , 2014, SIAM J. Sci. Comput..

[17]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[18]  Henricus Bouwmeester,et al.  Tiled Algorithms for Matrix Computations on Multicore Architectures , 2013, ArXiv.

[19]  Philippe Baptiste,et al.  Constraint - based scheduling : applying constraint programming to scheduling problems , 2001 .

[20]  Robert A. van de Geijn,et al.  The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.

[21]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.