Optimizing a Parallel Conjugate Gradient Solver

We develop a highly optimized parallel conjugate gradient solver. We look at both the single node performance and the parallel efficiency. We show that we can solve a problem with 278,000 degrees of freedom on a 32 node Hitachi SR4300 with a performance of 1.1 GFLOPS. We also look at the effect of the quality of mesh partitioning on the performance of the algorithm.