An Enhanced Image Reconstruction Tool for Computed Tomography on GPUs

The algebraic reconstruction technique (ART) is an iterative algorithm for CT (i.e., computed tomography) image reconstruction that delivers better image quality with less radiation dosage than the industry-standard filtered back projection (FBP). However, the high computational cost of ART requires researchers to turn to high-performance computing to accelerate the algorithm. Alas, existing approaches for ART suffer from inefficient design of compressed data structures and computational kernels on GPUs. Thus, this paper presents our enhanced CUDA-based CT image reconstruction tool based on the algebraic reconstruction technique (ART) or cuART. It delivers a compression and parallelization solution for ART-based image reconstruction on GPUs. We address the under-performing, but popular, GPU libraries, e.g., cuSPARSE, BRC, and CSR5, on the ART algorithm and propose a symmetry-based CSR format (SCSR) to further compress the CSR data structure and optimize data access for both SpMV and SpMV_T via a column-indices permutation. We also propose sorting-based and sorting-free blocking techniques to optimize the kernel computation by leveraging the sparsity patterns of the system matrix. The end result is that cuART can reduce the memory footprint significantly and enable practical CT datasets to fit into a single GPU. The experimental results on a NVIDIA Tesla K80 GPU illustrate that our approach can achieve up to 6.8x, 7.2x, and 5.4x speedups over counterparts that use cuSPARSE, BRC, and CSR5, respectively.

[1]  A. Kak,et al.  Simultaneous Algebraic Reconstruction Technique (SART): A Superior Implementation of the Art Algorithm , 1984, Ultrasonic imaging.

[2]  Michael Garland,et al.  Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format , 2016, PPoPP.

[3]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Rubao Lee,et al.  Spark-GPU: An accelerated in-memory data processing engine on clusters , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[5]  P. Sadayappan,et al.  An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs , 2014, ICS '14.

[6]  Michael Kunz,et al.  An implementation of 3D Electron Tomography on FPGAs , 2012, 2012 International Conference on Reconfigurable Computing and FPGAs.

[7]  B. F. Logan,et al.  The Fourier reconstruction of a head section , 1974 .

[8]  Limin Xiao,et al.  GPU accelerated sparse matrix‐vector multiplication and sparse matrix‐transpose vector multiplication , 2015, Concurr. Comput. Pract. Exp..

[9]  Pheng-Ann Heng,et al.  Accelerating simultaneous algebraic reconstruction technique with motion compensation using CUDA-enabled GPU , 2010, International Journal of Computer Assisted Radiology and Surgery.

[10]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Tao Yang,et al.  GPU based iterative cone-beam CT reconstruction using empty space skipping technique. , 2013, Journal of X-ray science and technology.

[12]  Shengen Yan,et al.  yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.

[13]  Xiaodong Yu Deep packet inspection on large datasets : algorithmic and parallelization techniques for accelerating regular expression matching on many-core processors , 2013 .

[14]  Cameron Melvin,et al.  Design, development and implementation of a parallel algorithm for computed tomography using algebraic reconstruction technique , 2007 .

[15]  Michela Becchi,et al.  Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16]  Hao Gao,et al.  Memory‐efficient algorithm for stored projection and backprojection matrix in helical CT , 2017, Medical physics.

[17]  R. Gordon,et al.  A projection access order for speedy convergence of ART (algebraic reconstruction technique): a multilevel scheme for computed tomography , 1994, Physics in medicine and biology.

[18]  Limin Xiao,et al.  Atomic reduction based sparse matrix-transpose vector multiplication on GPUs , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[19]  Shunli Zhang,et al.  Fast and accurate computation of system matrix for area integral model-based algebraic reconstruction technique , 2014 .

[20]  Weifeng Liu,et al.  Parallel Transposition of Sparse Data Structures , 2016, ICS.

[21]  P. Gilbert Iterative methods for the three-dimensional reconstruction of an object from projections. , 1972, Journal of theoretical biology.

[22]  Wu-chun Feng,et al.  cuART: Fine-Grained Algebraic Reconstruction Technique for Computed Tomography Images on GPUs , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[23]  Bronis R. de Supinski,et al.  Directive-Based Pipelining Extension for OpenMP , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[24]  Rui Liu,et al.  GPU-Based Acceleration for Interior Tomography , 2014, IEEE Access.

[25]  Klaus Mueller,et al.  The weighted-distance scheme: a globally optimizing projection ordering method for ART , 1997, IEEE Transactions on Medical Imaging.

[26]  Xiaodong Yu,et al.  Exploring different automata representations for efficient regular expression matching on GPUs , 2013, PPoPP '13.

[27]  G. Herman,et al.  Algebraic reconstruction techniques (ART) for three-dimensional electron microscopy and x-ray photography. , 1970, Journal of theoretical biology.

[28]  Xiaodong Yu,et al.  GPU acceleration of regular expression matching for large datasets: exploring the implementation space , 2013, CF '13.

[29]  Françoise Peyrin,et al.  Parallel Image Reconstruction on MIMD Computers for Three-Dimensional Cone-Beam Tomography , 1998, Parallel Comput..

[30]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[31]  Gabor T. Herman,et al.  Image Reconstruction From Projections , 1975, Real Time Imaging.

[32]  Yuan Yuan,et al.  Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores , 2015, Proc. VLDB Endow..

[33]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[34]  Brian Vinter,et al.  Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors , 2015, Parallel Comput..

[35]  Hao Wang,et al.  GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs , 2017, Conf. Computing Frontiers.

[36]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[37]  Yuan Yuan,et al.  The Yin and Yang of Processing Data Warehousing Queries on GPU Devices , 2013, Proc. VLDB Endow..