OpenCL performance portability for general‐purpose computation on graphics processor units: an exploration on cryptographic primitives

The modern trend toward heterogeneous many‐core architectures has led to high architectural diversity in both high performance and high‐end embedded systems. To effectively exploit the computational resources of such a wide range of architectures, programming languages and APIs such as OpenCL have become increasingly popular. Although OpenCL provides functional code portability and the ability to fine tune the application to the target hardware, providing performance portability is still an open problem. Thus, many research works have investigated the optimization of specific combinations of application and target platform. In this paper, we aim at leveraging the experience obtained in the implementation of algorithms from the cryptography domain to provide a set of guidelines for modern many‐core heterogeneous architecture performance portability and to establish a base on which domain‐specific languages and compiler transformations could be built in the near future. We study algorithmic choices and the effect of compiler transformations on three representative applications in the chosen domain on a set of seven target platforms. To estimate how well the application fits the architecture, we define a metric of computational intensity both for the architecture and the application implementation. Besides being useful to compare either different implementation or algorithmic choices and their fitness to a specific architecture, it can also be useful to the compiler to guide the code optimization process. Copyright © 2014 John Wiley & Sons, Ltd.

[1]  Giovanni Agosta,et al.  Record Setting Software Implementation of DES Using CUDA , 2010, 2010 Seventh International Conference on Information Technology: New Generations.

[2]  Bart De Decker Communications and multimedia security : 12th IFIP TC 6/TC 11 International Conference, CMS 2011, Ghent, Belgium, October 19-21, 2011 : proceedings , 2011, CMS 2011.

[3]  Jie Cheng,et al.  CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[4]  Giovanni Agosta,et al.  Design of a parallel AES for graphics hardware using the CUDA framework , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[5]  Michael Chu,et al.  Scientific and Engineering Computing Using ATI Stream Technology , 2009, Computing in Science & Engineering.

[6]  Ewa Niewiadomska-Szynkiewicz,et al.  A Hybrid CPU/GPU Cluster for Encryption and Decryption of Large Amounts of Data , 2012 .

[7]  Hai Jiang,et al.  CUDA-based AES parallelization with fine-tuned GPU memory utilization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[8]  Vincent Roca,et al.  Parallel arithmetic encryption for high-bandwidth communications on multicore/GPGPU platforms , 2010, PASCO.

[9]  Takakazu Kurokawa,et al.  AES Encryption Implementation on CUDA GPU and Its Analysis , 2010, 2010 First International Conference on Networking and Computing.

[10]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[11]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[12]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[13]  S.A. Manavski,et al.  CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography , 2007, 2007 IEEE International Conference on Signal Processing and Communications.

[14]  Nhat-Phuong Tran,et al.  Heterogeneous parallel computing for data encryption application , 2012, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).

[15]  Giovanni Agosta,et al.  Exploiting bit-level parallelism in GPGPUs: A case study on KeeLoq exhaustive key search attack , 2012, ARCS 2012.

[16]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[17]  David Black-Schaffer,et al.  The HIPEAC vision for advanced computing in horizon 2020 , 2013 .

[18]  Takakazu Kurokawa,et al.  Acceleration of AES encryption on CUDA GPU , 2012, Int. J. Netw. Comput..

[19]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[20]  John Waldron,et al.  AES Encryption Implementation and Analysis on Commodity Graphics Processing Units , 2007, CHES.

[21]  Takakazu Kurokawa,et al.  HiCrypt: C to CUDA Translator for Symmetric Block Ciphers , 2012, 2012 Third International Conference on Networking and Computing.

[22]  Xinxin Mei,et al.  Implementation and Analysis of AES Encryption on GPU , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[23]  Erik Zenner,et al.  Improved Software Implementation of DES Using CUDA and OpenCL , 2011 .

[24]  Ulrike Meyer,et al.  GPU-Acceleration of Block Ciphers in the OpenSSL Cryptographic Library , 2012, ISC.

[25]  Jürgen Fuß,et al.  GPU-Assisted AES Encryption Using GCM , 2011, Communications and Multimedia Security.

[26]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[27]  Simon Josefsson,et al.  The scrypt Password-Based Key Derivation Function , 2016, RFC.

[28]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[29]  John Waldron,et al.  Practical Symmetric Key Cryptography on Modern Graphics Hardware , 2008, USENIX Security Symposium.

[30]  Giovanni Agosta,et al.  Fast Disk Encryption through GPGPU Acceleration , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[31]  Patrick Fay,et al.  Breakthrough AES Performance with Intel ® AES New Instructions , 2010 .

[32]  Seung-Jae Lee,et al.  Parallel Execution of AES-CTR Algorithm Using Extended Block Size , 2011, 2011 14th IEEE International Conference on Computational Science and Engineering.

[33]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[34]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[35]  Takakazu Kurokawa,et al.  High-Performance Symmetric Block Ciphers on CUDA , 2011, 2011 Second International Conference on Networking and Computing.

[36]  Eli Biham,et al.  A Fast New DES Implementation in Software , 1997, FSE.