Performance of the Finite Difference Method Using Cache and Shared Memory for Massively Parallel Systems

Many algorithms have been introduced to improve performance by using massively parallel systems, which consist of several hundreds of processors. A typical example is a GPU system of many processors which uses shared memory. In the case of image filtering algorithms, which make references to neighboring points, the shared memory helps improve performance by frequently accessing adjacent pixels. However, using shared memory requires rewriting the existing codes and consequently results in complexity of the codes. Recent GPU systems support both L1 and L2 cache along with shared memory. Since the L1 cache memory is located in the same area as the shared memory, the improvement of performance is predictable by using the cache memory. In this paper, the performance of cache and shared memory were compared. In conclusion, the performance of cache-based algorithm is very similar to the one of shared memory. The complexity of the code appearing in a shared memory system, however, is resolved with the cache-based algorithm.

[1]  Andrew P. Witkin,et al.  Scale-Space Filtering , 1983, IJCAI.

[2]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[3]  Jitendra Malik,et al.  Scale-Space and Edge Detection Using Anisotropic Diffusion , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[5]  G. McMechan MIGRATION BY EXTRAPOLATION OF TIME‐DEPENDENT BOUNDARY VALUES* , 1983 .

[6]  Soon-Yong Park,et al.  Multi-view Range Image Registration using CUDA , 2008 .

[7]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[8]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[9]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Benedict R. Gaster,et al.  Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck? , 2012, Computer.

[11]  Samuel Williams,et al.  Auto-Tuning the 27-point Stencil for Multicore , 2009 .

[12]  Ross T. Whitaker,et al.  Variable-conductance, level-set curvature for image denoising , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[13]  Guido Gerig,et al.  Nonlinear anisotropic filtering of MRI data , 1992, IEEE Trans. Medical Imaging.

[14]  Anthony J. Yezzi,et al.  Modified curvature motion for image smoothing and enhancement , 1998, IEEE Trans. Image Process..

[15]  Guillermo Sapiro,et al.  Robust anisotropic diffusion , 1998, IEEE Trans. Image Process..

[16]  Chengdong Wu,et al.  A time-dependent anisotropic diffusion image smoothing method , 2011, 2011 2nd International Conference on Intelligent Control and Information Processing.

[17]  Majid Sarrafzadeh,et al.  A memory optimization technique for software-managed scratchpad memory in GPUs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[18]  Marko Jurmu,et al.  Multipurpose Interactive Public Displays in the Wild: Three Years Later , 2012, Computer.

[19]  John D. Owens,et al.  General Purpose Computation on Graphics Hardware , 2005, IEEE Visualization.