On the Comparative Performance of Parallel Algorithms on Small GPU / CUDA Clusters

CUDA programmed GPUs are rapidly becoming a major choice in high performance computing and there are a growing number of applications which are being ported to the CUDA platform. However much less research has been carried out to evaluate the performance when CUDA is integrated with other parallel programming paradigms. We have developed a general purpose matrix multiplication algorithm and a Conjugate Gradient algorithm using CUDA and MPI. In this approach, MPI works as the data distributing mechanism between the GPU nodes and CUDA as the main computing engine. This enables the programmer to connect GPU nodes via high speed Ethernet without special technologies and also it helps the programmer to see the separate GPU nodes as they are and execute different components of a program in several GPU nodes.