Detector defect correction of medical images on graphics processors

The ever increasing complexity and power dissipation of computer architectures in the last decade blazed the trail for more power efficient parallel architectures. Hence, such architectures like field-programmable gate arrays (FPGAs) and particular graphics cards attained great interest and are consequently adopted for parallel execution of many number crunching loop programs from fields like image processing or linear algebra. However, there is little effort to deploy barely computational, but memory intensive applications to graphics hardware. This paper considers a memory intensive detector defect correction pipeline for medical imaging with strict latency requirements. The image pipeline compensates for different effects caused by the detector during exposure of X-ray images and calculates parameters to control the subsequent dosage. So far, dedicated hardware setups with special processors like DSPs were used for such critical processing. We show that this is today feasible with commodity graphics hardware. Using CUDA as programming model, it is demonstrated that the detector defect correction pipeline consisting of more than ten algorithms is significantly accelerated and that a speedup of 20x can be achieved on NVIDIA's Quadro FX 5800 compared to our reference implementation. For deployment in a streaming application with steadily new incoming data, it is shown that the memory transfer overhead of successive images to the graphics card memory is reduced by 83% using double buffering.