Parallelized QR decomposition using GPUs

This paper presents the performance results of a parallelized, accelerated eigendecomposition using the block Householder QR decomposition algorithm on a graphic processing unit (GPU). The QR software was developed using NVIDIA’s CUDA parallel programming and computing platform and executed on an NVIDIA Tesla GPU accelerator card. Factors affecting program performance of the GPU-accelerated QR implementation are highlighted with respect to the baseline serial implementation developed in MATLAB and executed on a conventional multi-core processor. We compare results with relevant previously published studies and discuss possible performance bottlenecks and speedups.