论文信息 - Implementing a parallel sparse matrix-vector multiplication using dataflow

Implementing a parallel sparse matrix-vector multiplication using dataflow

Sparse matrix-vector multiplication (SpMV) is an important operation in sparse linear algebra problems. According to Bell [1], “they represent the dominant cost in many iterative methods for solving large-scale linear systems”. In this paper we propose a dataflow parallel version of the SpMV kernel using Compressed Sparse Row (CSR) storage format. This implementation is executed in a Dataflow Runtime Environment called Trebuchet [3][4][5]. The dataflow model[2] exposes parallelism by essence. Task execution is triggered as soon as the input data is available. A dataflow program can be described as a dataflow graph where nodes represent instructions (or tasks) and edges represent their input/output dependencies. Two instructions can run concurrently if there is no directed path between them and their pieces of data are available. This differs from Von Neumann model where instructions are guided by control flow and their execution is inherently sequential. Our experimental environment consists of a machine with 4 Quad-Core Intel Core i7-3820 processors running at 3.6GHz and 64 GB RAM. We performed the SpMV kernel in three matrices from University of Florida Sparse Matrix Collection1, listed in Table 1 . Figure 1 presents the results for Dataflow(running on Trebuchet), Intel Math Kernel Library(MKL)2 (which uses OpenMP3 in parallel version) implementations. Results show that with a naive dataflow implementation it is already possible to achieve higher speedups than the ones obtained with Intel MKL. For smalled matrices the dataflow runtime environment overheads become more evident, but for larger ones, Trebuchet provides better results. In order to improve performance, we intend to implement other SpMV kernels using different storage formats and other approaches to get better use of cache locality[6]. Moreover, since dataflow allows more flexibility in describing application dependencies, we may find more interesting parallelization patterns that are easier to implement and provide smaller overheads in dataflow.

[1] Vítor Santos Costa,et al. TALM: A Hybrid Execution Model with Distributed Speculation Support , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing Workshops.

[2] Vítor Santos Costa,et al. Trebuchet: exploring TLP with dataflow virtualisation , 2011, Int. J. High Perform. Syst. Archit..

[3] Jack B. Dennis,et al. A preliminary architecture for a basic data-flow processor , 1974, ISCA '75.

[4] Samuel Williams,et al. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[5] Jack B. Dennis,et al. A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.

[6] Michael Garland,et al. Eﬃcient Sparse Matrix-Vector Multiplication on CUDA , 2008 .