Implementation of a Parallel Sparse Direct Solver on Vector Architecture

Linear systems with large sparse matrices are solved in finite element analysis of elasticity and/or fluid problems. Thanks to development of graph partitioning software, it becomes feasible to extract dense sub-matrices efficiently with minimizing fill-in during factorization. By analyzing task dependency of block factorization of dense matrix, multi-cores of CPUs which share the main memory are used in parallel and asynchronously. The tasks in dense sub-matrices consist of BLAS level 3 kernels which efficiently use arithmetic capabilities of modern super-scalar CPU with large cache memory and also of modern vector CPU. BLAS level 3 kernels can also efficiently use vector architecture, without writing any directives for explicit vectorization in the code. Nevertheless, the sparse part still remains in factorization process. Although it is only a small fraction of the whole process and almost negligible on the super-scalar CPU, its optimization is important on vector architecture due to short vector loop.