An OpenACC Optimizer for Accelerating Histogram Computation on a GPU

This paper presents a source-to-source OpenACC optimizer that automatically optimizes a histogram computation code for a graphics processing unit (GPU). Parallel histogram computation codes typically deploy multiple copies of histograms and update them with atomic operations. This duplication method can be implemented as an OpenACC code. However, the structure of sequential code blocks must be manually rewritten owing to the limitation on OpenACC directives. Such a rewritten code does not always achieve the highest performance on arbitrary platforms, and thus, the duplication method degrades the performance portability of the code. To tackle this issue, we propose an optimizer that identifies histogram-related blocks in a naive OpenACC code and automatically rewrites the detected blocks such that multiple copies of histograms can be exploited for acceleration. In experiments, we apply our optimizer to three practical applications and investigate their performance on three platforms: an NVIDIA GPU, an AMD GPU and an Intel CPU. Experimental results show that our automated approach is useful for OpenACC codes to maximize the performance of histogram computation, and thereby enhancing the performance portability of the code.

[1]  Amirali Baniasadi,et al.  Employing Software-Managed Caches in OpenACC , 2016, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[2]  Rudolf Eigenmann,et al.  OpenMPC: extended OpenMP for efficient programming and tuning on GPUs , 2013, Int. J. Comput. Sci. Eng..

[3]  Fumihiko Ino,et al.  Sequence Homology Search Using Fine Grained Cycle Sharing of Idle GPUs , 2012, IEEE Transactions on Parallel and Distributed Systems.

[4]  Rodney A. Kennedy,et al.  Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices , 2007 .

[5]  Mitsuhisa Sato,et al.  XcalableACC: Extension of XcalableMP PGAS Language Using OpenACC for Accelerator Clusters , 2014, 2014 First Workshop on Accelerator Programming using Directives.

[6]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Satoshi Matsuoka,et al.  An OpenACC Extension for Data Layout Transformation , 2014, 2014 First Workshop on Accelerator Programming using Directives.

[8]  Henk Corporaal,et al.  High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs , 2011, GPGPU-4.

[9]  Toru Fujiwara,et al.  Enumerating Joint Weight of a Binary Linear Code Using Parallel Architectures: multi-core CPUs and GPUs , 2015, Int. J. Netw. Comput..

[10]  Richard W. Vuduc,et al.  Effective Source-to-Source Outlining to Support Whole Program Empirical Optimization , 2009, LCPC.

[11]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[12]  Fumihiko Ino,et al.  Efficient Acceleration of Mutual Information Computation for Nonrigid Registration Using CUDA , 2014, IEEE Journal of Biomedical and Health Informatics.

[13]  Fumihiko Ino,et al.  Accelerating the Smith-Waterman algorithm with interpair pruning and band optimization for the all-pairs comparison of base sequences , 2015, BMC Bioinformatics.

[14]  Mitsuhisa Sato,et al.  Productivity and Performance of Global-View Programming with XcalableMP PGAS Language , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[15]  P.V.C. Hough,et al.  Machine Analysis of Bubble Chamber Pictures , 1959 .

[16]  Eddy Z. Zhang,et al.  Massive atomics for massive parallelism on GPUs , 2014, ISMM '14.

[17]  Anuj Agarwal,et al.  Analysis of sleep traits in knockout mice from the large-scale KOMP2 population using a non-invasive, high-throughput piezoelectric system , 2015, BMC Bioinformatics.

[18]  Alex Ramírez,et al.  Parallelizing general histogram application for CUDA architectures , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[19]  Fumihiko Ino,et al.  High-performance cone beam reconstruction using CUDA compatible GPUs , 2010, Parallel Comput..

[20]  Fumihiko Ino,et al.  PACC : An Extension of OpenACC for Pipelined Processing of Large Data on a GPU , 2014 .