A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

Modern multi-processing architectures are commonly based on shared memory systems with a NUMA behavior. These computers are composed of several chip-sets including one or several cores associated to a memory bank. The chipset are linked together with a cache-coherent interconnection system. Such an architecture implies hierarchical memory access times from a given core to the different memory banks. This architecture also possibly incurs different bandwidths following the respective location of a given core and the location of the data sets that this core is using [2]. It is thus important on such platforms to take these processor/memory locality effects into account when allocating resources. Modern operating systems commonly provide some API dedicated to NUMA architectures which allow programmers to control where threads are executed and memory is allocated. These interfaces have been used in the following part to exhibit NUMA effects on different architectures. First, we study the cost of placement combinations of threads and memory on a set of BLAS functions. Table 1 and 2 shows the NUMA factor on one node of the NUMA8 architecture 1 . The results confirmed the presence of a shared memory bank for each chip of two cores. We also observed that the effects are more important for computations which do not reuse data as BLAS routines of level 1 and where data transfer is the bottleneck. The NUMA factor increases to 1.58 on this architecture.

[1]  Rajeev Thakur,et al.  Test Suite for Evaluating Performance of MPI Implementations That Support MPI_THREAD_MULTIPLE , 2007, PVM/MPI.

[2]  Pascal Hénon,et al.  On Using an Hybrid MPI-Thread Programming for the Implementation of a Parallel Sparse Direct Solver on a Network of SMP Nodes , 2005, PPAM.

[3]  Pascal Hénon,et al.  PaStiX: A High-Performance Parallel Direct Solver for Sparse Symmetric Definite Systems , 2000 .

[4]  James Demmel,et al.  SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems , 2003, TOMS.

[5]  Pascal Hénon,et al.  PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems , 2002, Parallel Comput..

[6]  Nathalie Furmento,et al.  NewMadeleine: a Fast Communication Scheduling Engine for High Performance Networks , 2007 .

[7]  Pascal Hénon,et al.  PaStiX: A Parallel Sparse Direct Solver Based on a Static Scheduling for Mixed 1D/2D Block Distributions , 2000, IPDPS Workshops.

[8]  J. Roman,et al.  On finding approximate supernodes for an efficient ILU(k) factorization , 2006 .

[9]  Michael Lang,et al.  Experiences in scaling scientific applications on current-generation quad-core processors , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[10]  Samuel Thibault,et al.  Building Portable Thread Schedulers for Hierarchical Multiprocessors: The BubbleSched Framework , 2007, Euro-Par.

[11]  Joseph Antony,et al.  Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport , 2006, HiPC.

[12]  Anshul Gupta,et al.  Recent Progress in General Sparse Direct Solvers , 2001, International Conference on Computational Science.

[13]  Patrick Amestoy,et al.  A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling , 2001, SIAM J. Matrix Anal. Appl..

[14]  Francisco F. Rivera,et al.  Scheduling of Algorithms Based on Elimination Trees on NUMA Systems , 1999, Euro-Par.

[15]  Pascal Hénon,et al.  On finding approximate supernodes for an efficient block-ILU(k , 2008, Parallel Comput..