Scheduling and memory optimizations for sparse direct solver on multi-core/multi-gpu duster systems. (Ordonnancement et optimisations mémoire pour un solveur creux par méthodes directes sur des machines hétérogènes)

The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this thesis, we study the benefits and the limits of replacing the highly specialized internal scheduler of the PaStiX solver by two generic runtime systems: PaRSEC and StarPU. Thus, we have to describe the factorization algorithm as a tasks graph that we provide to the runtime system. Then it can decide how to process and optimize the graph traversal in order to maximize the algorithm efficiency for thetargeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its original internal scheduler, PaRSEC, and StarPU frameworks is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer. In this thesis, we also study the possibilities to build a distributed sparse linear solver on top of task-based runtime systems to target heterogeneous clusters. To permit an efficient and easy usage of these developments in parallel simulations, we also present an optimized distributed interfaceaiming at hiding the complexity of the construction of a distributed matrix to the user.

[1]  Cleve Ashc Raft The fan-both family of column-based distributed Cholesky factorization algorithms , 1993 .

[2]  Jennifer A. Scott,et al.  Design of a Multicore Sparse Cholesky Factorization Using DAGs , 2010, SIAM J. Sci. Comput..

[3]  Mathieu Faverge,et al.  Ordonnancement hybride statique-dynamique en algèbre linéaire creuse pour de grands clusters de machines NUMA et multi-coeurs , 2009 .

[4]  Pierre Ramet,et al.  A NUMA Aware Scheduler for a Parallel Sparse Direct Solver , 2009 .

[5]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[6]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[7]  James Demmel,et al.  A Supernodal Approach to Sparse Partial Pivoting , 1999, SIAM J. Matrix Anal. Appl..

[8]  J. Roman,et al.  On finding approximate supernodes for an efficient ILU(k) factorization , 2006 .

[9]  Pierre Ramet,et al.  Fine Grain Scheduling for Sparse Solver on Manycore Architectures , 2012 .

[10]  Xiaoye S. Li Evaluation of Sparse LU Factorization and Triangular Solution on Multicore Platforms , 2008, VECPAR.

[11]  Pascal Hénon,et al.  PaStiX: A High-Performance Parallel Direct Solver for Sparse Symmetric Definite Systems , 2000 .

[12]  Julien Langou,et al.  Algorithm 842: A set of GMRES routines for real and complex arithmetics on high performance computers , 2005, TOMS.

[13]  George Bosilca,et al.  Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach , 2012 .

[14]  Ichitaro Yamazaki PDSLin User Guide , 2011 .

[15]  James Demmel,et al.  SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems , 2003, TOMS.

[16]  Yousef Saad,et al.  ILUT: A dual threshold incomplete LU factorization , 1994, Numer. Linear Algebra Appl..

[17]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[18]  William Gropp,et al.  MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[19]  Iain S. Duff,et al.  Sparse system solution and the HSL Library , 2006 .

[20]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[21]  Mario Ricchiuto,et al.  Comparison of high order algorithms in Aerosol and Aghora for compressible flows , 2013 .

[22]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[23]  George Bosilca,et al.  Toward a supernodal sparse direct solver over DAG runtimes , 2012 .

[24]  Timothy A. Davis,et al.  Algorithm 832: UMFPACK V4.3---an unsymmetric-pattern multifrontal method , 2004, TOMS.

[25]  Brice Goglin,et al.  Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective , 2009, IWOMP.

[26]  Vipin Kumar,et al.  WSSMP: A High-Performance Shared- and Distributed-Memory Parallel Sparse Symmetric Linear Equation Solver , 2007 .

[27]  Olivier Czarny,et al.  Bézier surfaces and finite elements for MHD simulations , 2008, J. Comput. Phys..

[28]  Anshul Gupta,et al.  Recent Progress in General Sparse Direct Solvers , 2001, International Conference on Computational Science.

[29]  Jack J. Dongarra,et al.  Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.

[30]  Anamitra R. Choudhury,et al.  Multifrontal Factorization of Sparse SPD Matrices on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[31]  Patrick R. Amestoy,et al.  Multifrontal parallel distributed symmetric and unsymmetric solvers , 2000 .

[32]  Iain S. Du Sparse system solution and the HSL Library , 2006 .

[33]  Vladimir Volokhov,et al.  Parallel geometric multigrid , 2016, Int. J. Comput. Sci. Math..

[34]  Emmanuel Jeannot,et al.  Compact DAG representation and its symbolic scheduling , 1999, J. Parallel Distributed Comput..

[35]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[36]  Pierre Ramet,et al.  Sparse direct solver on top of large-scale multicore systems with GPU accelerators , 2012 .

[37]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[38]  Anoop Gupta,et al.  An efficient block-oriented approach to parallel sparse Cholesky factorization , 1993, Supercomputing '93. Proceedings.

[39]  J. Hogg High performance Cholesky and symmetric indefinite factorizations with applications , 2010 .

[40]  James Demmel,et al.  Parallel Symbolic Factorization for Sparse LU with Static Pivoting , 2007, SIAM J. Sci. Comput..

[41]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[42]  Anshul Gupta A Shared- and distributed-memory parallel general sparse direct solver , 2007, Applicable Algebra in Engineering, Communication and Computing.

[43]  Herb Sutter,et al.  The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .

[44]  Robert A. van de Geijn,et al.  High performance dense linear algebra on a spatially distributed processor , 2008, PPoPP.

[45]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[46]  Al Geist,et al.  Task scheduling for parallel sparse Cholesky factorization , 1990, International Journal of Parallel Programming.

[47]  Mahesh V. Joshi,et al.  PSPASES: Scalable Parallel Direct Solver Library for Sparse Symmetric Positive Definite Linear Syste , 1999 .

[48]  Jennifer A. Scott,et al.  New Parallel Sparse Direct Solvers for Multicore Architectures , 2013, Algorithms.

[49]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[50]  Jan Westerholm,et al.  Efficient Assembly of Sparse Matrices Using Hashing , 2006, PARA.

[51]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[52]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[53]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[54]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[55]  Michael Klemm,et al.  OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.

[56]  Dirk Eddelbuettel,et al.  Benchmarking Single- and Multi-Core BLAS Implementations and GPUs for use with R , 2010 .

[57]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[58]  G. Huysmans,et al.  MHD stability in X-point geometry: simulation of ELMs , 2007 .

[59]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[60]  Victor Eijkhout,et al.  A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling , 2014, ACM Trans. Math. Softw..

[61]  YANQING CHEN,et al.  Algorithm 8 xx : CHOLMOD , supernodal sparse Cholesky factorization and update / downdate ∗ , 2006 .

[62]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[63]  Pierre Ramet,et al.  A task-based sparse direct solver suited for large scale hierarchical/heterogeneous architectures , 2015 .

[64]  Pascal Hénon,et al.  A Parallel Direct/Iterative Solver Based on a Schur Complement Approach , 2008, 2008 11th IEEE International Conference on Computational Science and Engineering.

[65]  P. Charrier,et al.  Algorithmique et calculs de complexité pour un solveur de type dissections emboîtées , 1989 .

[66]  Barry W. Peyton,et al.  Block sparse Cholesky algorithms on advanced uniprocessor computers , 1991 .

[67]  Edward Rothberg Performance of Panel and Block Approaches to Sparse Cholesky Factorization on the iPSC/860 and Paragon Multicomputers , 1996, SIAM J. Sci. Comput..

[68]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[69]  Olaf Schenk,et al.  Solving unsymmetric sparse systems of linear equations with PARDISO , 2002, Future Gener. Comput. Syst..

[70]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[71]  Richard W. Vuduc,et al.  A Distributed CPU-GPU Sparse Direct Solver , 2014, Euro-Par.

[72]  Jack Dongarra,et al.  Sparse direct solvers with accelerators over DAG runtimes , 2012 .

[73]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[74]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[75]  Wolfgang Hackbusch,et al.  Multi-grid methods and applications , 1985, Springer series in computational mathematics.

[76]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[77]  Pierre Ramet,et al.  Dynamic scheduling for sparse direct solver on NUMA architectures , 2008 .

[78]  Alfredo Buttari,et al.  Fine-Grained Multithreading for the Multifrontal QR Factorization of Sparse Matrices , 2013, SIAM J. Sci. Comput..

[79]  Jesús Labarta,et al.  Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..

[80]  Jack J. Dongarra,et al.  EZTrace: A Generic Framework for Performance Analysis , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[81]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[82]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[83]  Robert A. van de Geijn,et al.  SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.

[84]  Julien Langou,et al.  The Impact of Multicore on Math Software , 2006, PARA.

[85]  Pascal Hénon,et al.  PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems , 2002, Parallel Comput..

[86]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[87]  Patrick R. Amestoy,et al.  An Approximate Minimum Degree Ordering Algorithm , 1996, SIAM J. Matrix Anal. Appl..

[88]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[89]  Emmanuel Jeannot,et al.  Compact DAG Representation and Its Dynamic Scheduling , 1999, J. Parallel Distributed Comput..

[90]  Jérémie Allard,et al.  Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations , 2010, Euro-Par.

[91]  Xavier Lacoste Work stealing and granularity optimizations for a sparse solver on manycores , 2013 .

[92]  Helmar Burkhart,et al.  General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform , 2007 .

[93]  Ümit V. Çatalyürek,et al.  Improving performance of adaptive component-based dataflow middleware , 2012, Parallel Comput..

[94]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[95]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[96]  Ichitaro Yamazaki,et al.  New Scheduling Strategies and Hybrid Programming for a Parallel Right-looking Sparse LU Factorization Algorithm on Multicore Cluster Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[97]  A. Brandt Algebraic multigrid theory: The symmetric case , 1986 .

[98]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[99]  J. Pasciak,et al.  Computer solution of large sparse positive definite systems , 1982 .

[100]  Asim YarKhan,et al.  Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .

[101]  Robert E. Tarjan,et al.  Algorithmic Aspects of Vertex Elimination on Graphs , 1976, SIAM J. Comput..

[102]  James Demmel,et al.  Making Sparse Gaussian Elimination Scalable by Static Pivoting , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[103]  Jack Dongarra,et al.  Fully Dynamic Scheduler for Numerical Computing on Multicore Processors , 2009 .

[104]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[105]  Pascal Hénon Distribution des données et régulation statique des calculs et des communications pour la résolution de grands systèmes linéaires creux par méthode directe , 2001 .

[106]  Jack J. Dongarra,et al.  Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[107]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[108]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[109]  Timothy A. Davis,et al.  Accelerating sparse cholesky factorization on GPUs , 2014, IA3 '14.

[110]  Azzam Haidar,et al.  Parallel algebraic hybrid solvers for large 3D convection-diffusion problems , 2008, Numerical Algorithms.

[111]  Patrick Amestoy,et al.  A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling , 2001, SIAM J. Matrix Anal. Appl..

[112]  Joseph W. H. Liu The role of elimination trees in sparse factorization , 1990 .

[113]  Joseph W. H. Liu,et al.  A Comparison of Three Column-Based Distributed Sparse Factorization Schemes. , 1990 .

[114]  Pascal Hénon,et al.  On finding approximate supernodes for an efficient block-ILU(k , 2008, Parallel Comput..

[115]  Cédric Augonnet,et al.  Scheduling Tasks over Multicore machines enhanced with acelerators: a Runtime System's Perspective , 2011 .

[116]  Harvey Richardson,et al.  High Performance Fortran: history, overview and current developments , 1996 .

[117]  Stanimire Tomov,et al.  One-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators , 2012, ICCS.

[118]  Roger Grimes,et al.  Multifrontal Computations on GPUs and Their Multi-core Hosts , 2010, VECPAR.

[119]  Helmar Burkhart,et al.  Algorithmic performance studies on graphics processing units , 2008, J. Parallel Distributed Comput..

[120]  Chenhan D. Yu,et al.  A CPU-GPU hybrid approach for the unsymmetric multifrontal method , 2011, Parallel Comput..

[121]  Jennifer A. Scott,et al.  A Sparse Symmetric Indefinite Direct Solver for GPU Architectures , 2016, ACM Trans. Math. Softw..

[122]  George Bosilca,et al.  Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[123]  Jean Roman,et al.  Sparse Matrix Ordering with SCOTCH , 1997, HPCN Europe.

[124]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[125]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[126]  Victor Eijkhout,et al.  Sparse direct factorizations through unassembled hyper-matrices , 2010 .

[127]  Vipin Kumar,et al.  Highly Scalable Parallel Algorithms for Sparse Matrix Factorization , 1997, IEEE Trans. Parallel Distributed Syst..

[128]  Robert Schreiber,et al.  Scalability of Sparse Direct Solvers , 1993 .

[129]  T. Manteuffel An incomplete factorization technique for positive definite linear systems , 1980 .

[130]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[131]  Sivasankaran Rajamanickam,et al.  ShyLU: A Hybrid-Hybrid Solver for Multicore Platforms , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[132]  Michael T. Heath,et al.  Parallel Algorithms for Sparse Linear Systems , 1991, SIAM Rev..

[133]  Laxmikant V. Kalé,et al.  Programming heterogeneous clusters with accelerators using object-based programming , 2011, Sci. Program..

[134]  Robert A. van de Geijn,et al.  The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations , 2012, J. Parallel Distributed Comput..

[135]  J. W. Walker,et al.  Direct solutions of sparse network equations by optimally ordered triangular factorization , 1967 .

[136]  Jesús Labarta,et al.  CellSs: Scheduling techniques to better exploit memory hierarchy , 2009, Sci. Program..

[137]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[138]  Gene Poole,et al.  Accelerating the ANSYS Direct Sparse Solver with GPUs , 2011 .

[139]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[140]  Robert Schreiber,et al.  Improved load distribution in parallel sparse Cholesky factorization , 1994, Proceedings of Supercomputing '94.

[141]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[142]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[143]  Cleve Ashcraft,et al.  A Fan-In Algorithm for Distributed Sparse Numerical Factorization , 1990, SIAM J. Sci. Comput..

[144]  Patrick Amestoy,et al.  Hybridizing Nested Dissection and Halo Approximate Minimum Degree for Efficient Sparse Matrix Ordering , 1999, Concurr. Pract. Exp..

[145]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[146]  Jean-Yves L'Excellent,et al.  Introduction of shared-memory parallelism in a distributed-memory multifrontal solver , 2013 .

[147]  Jaeyoung Choi,et al.  A Proposal for a Set of Parallel Basic Linear Algebra Subprograms , 1995, PARA.