Conjugate-Gradients Algorithms : An MPI-OpenMP Implementation on Distributed Shared Memory Systems

We present a parallel implementation, on distributed shared memory architectures of two Conjugate–Gradients algorithms: the CGSand theBiCGSTABmethods associated with some algebraic preconditioners. We analyze the programming environment supplied by an SMP cluster of Digital AlphaServer4100 . We take into account two different implementations of the algorithms: one based on MPI and the other one based on a mix of MPI calls, for the message passing among different nodes, and OpenMP directives to parallelize the basic linear algebra operations within each node. This multilevel parallelism can be also obtained with a combination of MPI and the Digital Extended Math Library Pthreads (DXMLP), a multi-threaded version of the BLAS, LAPACK routines, available on the considered architecture. 1 Parallel Implementation In this work we address a parallel implementation, using the SPMD programming paradigm, of the CGSand BiCGSTABmethods[1] and of some algebraic preconditioners based on diagonal scalings and incomplete factorizations. The SPMD programming model implies focusing on the data associated with the problem, determining an appropriate partition and working out how to associate computation with data. Several approaches are plausible, but we restricted our attention to partitioning the linear system by rows, choosing block-cyclicor cyclic (k)data distribution[2]. The basic computational kernels of the considered iterative schemes are: vector updates, inner products and matrix–vector products. Vector updates are trivially parallelizable: each processor updates its own segment of each vector. The inner products can be easily parallelized: each processor computes a local inner product (LIP) using the segments of each vector which are stored in its local memory; the LIPs travel across processors and are summed together in order to be reduced to the required inner product. The parallelization of matrix–vector products is related to the structure of the matrix, because each processor has only a segment of the vector in its memory. Thus, communication may be necessary to get other elements of the vector, resulting in global or local message passing operations. The software, which was first implemented on a Cray-T3D[3] and further tested on the HP-Convex Exemplar systems at CILEA[4], has been ported on the cluster of Digital AlphaServer at CASPUR in Rome. This cluster consists in four AlphaServer 3/400 4100 nodes[5] inteconnected with Memory Channel. Each node is a symmetric multiprocessing machine (SMP) with four CPUs and 8 Gigabytes of memory. Each CPU is an Alpha 21164 (400 Mhz) microprocessor, endowed with a primary 8-Kbytes instruction cache and two levels of data caches (8-Kbytes and 96-Kbytes respectively). Memory Channel Technology [6] is a high-performance network that implements a form of shared virtual memory. The current implementation consists of a 100-megabyte-per-second bus that provides a write-only path from a page of virtual address space of one node to a page of physical memory on another node. A Memory Channel cluster consists of one PCI adapter on each node and a hub connecting the nodes. For portability issues we implemented the code employing the MPI to manage interprocessor communication and relying on the standard BLAS library to perform linear algebra operations. This choice allowed us to exploit a combination of explicit message passing and shared memory parallelism using the Digital Extended Math Library Pthreads (DXMLP) [7]. This library, which includes BLAS and some LAPACK routines, is a set of mathematical subprograms optimized for Digital architectures, running in parallel on the single SMP node using Pthreads. Table 1: Scaled speedup, fixed speedup, number of iteration, total execution time and time per iteration ( sec) f r a fixed size problem ( n = 3008), varying the number p of processors. p scaled xed total time per number of speedup speedup time iteration iterations 1 1.00 1.00 79.22 0.56 142 2 1.99 1.67 47.32 0.33 142 3 2.98 2.16 36.76 0.27 133 4 3.88 2.62 30.26 0.21 147 5 4.71 3.40 23.32 0.17 139 6 5.79 3.95 20.05 0.14 140 7 6.28 3.85 20.59 0.14 146 8 7.36 4.95 15.99 0.12 134 10 8.41 6.29 12.60 0.09 135 12 9.56 7.90 10.03 0.07 137 14 9.86 10.52 7.53 0.05 140 16 9.38 13.09 6.05 0.04 144 Such multilevel parallelism can be also achieved by means of Guide from the KAP/Pro Toolset [8], by exploiting the parallelism available on SMP machines. Through the use of directives, Guide translates a program running on one processor into one that runs simultaneously on multiple processors. It uses the standard OpenMP directives on all supported UNIX and NT systems, so that portability is guaranteed. For codes that are already parallel, Guide also translate common Cray and SGI directives to OpenMP directives. The Guide features include Control Parallelism ( parallel region, parallel do, parallel section ) and Storage Parallelism (shared and private data specifications ). 2 Experimental results We analyze the performance obtained on a test problem using the finite element Wathen matrix [9]. It represents the n n consistency mass matrix for a regular nx by ny grid of 8-nodes element in 2 space dimension, where n = 3nx ny + 2nx + 2ny + 1 . For the sake of conciseness, we discuss only the results of the unpreconditionedBiCGSTABalgorithm. Table 1 shows the results obtained for the MPI version of the code, on a fixed size problem, varying the number of processor p. A satisfactory scalability is achieved for p 8 ; when all the available nodes of the cluster are involved in the computation ( p 10 ), the results are not so good, due to some conflicts in accessing the Memory Channel. Tables 2–5 display the performance of a small number of multi-threaded MPI processes. We report the times per iteration ( sec) of the DXLMP and Guide versions of the code. Table 2: Comparison between the DXMLP and Guide versions of the code, increasing the number of threads, for a numberproc = 1 of MPI processes.