cuGimli: optimized implementation of the Gimli authenticated encryption and hash function on GPU for IoT applications

Recently, National Institute of Standards and Technology (NIST) in the U.S. had initiated a global-scale competition to standardize the lightweight authenticated encryption with associated data (AEAD) and hash function. Gimli is one of the Round 2 candidates that is designed to be efficiently implemented across various platforms, including hardware (VLSI and FPGA), microprocessors, and microcontrollers. However, the performance of Gimli in massively parallel architectures like Graphics Processing Units (GPU) is still unknown. A high performance Gimli implementation on GPU can be especially useful to Internet of Things (IoT) applications, wherein the gateway devices and cloud servers need to handle a massive number of communications protected by AEAD. In this paper, we show that with careful optimization, Gimli can be efficiently implemented in desktop and embedded GPU to achieve extremely high throughput. Our experiments show that the proposed Gimli implementation can achieve 661.44 KB/s (encryption), 892.24 KB/s (decryption), and 4344.46 KB/s (hashing) in state-of-the-art GPUs.

[1]  Yosuke Todo,et al.  Gimli : A Cross-Platform Permutation , 2017, CHES.

[2]  Rahim Tafazolli,et al.  LEVER: Secure Deduplicated Cloud Storage with EncryptedTwo-Party Interactions in Cyber-Physical Systems , 2020 .

[3]  Peter W. Shor,et al.  Algorithms for quantum computation: discrete logarithms and factoring , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[4]  Kris Gaj,et al.  Hardware Benchmarking of Round 2 Candidates in the NIST Lightweight Cryptography Standardization Process , 2021, 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  Francisco Rodríguez-Henríquez,et al.  Implementation of RSA Signatures on GPU and CPU Architectures , 2020, IEEE Access.

[6]  Sheldon X.-D. Tan,et al.  GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation , 2019, IEEE Design & Test.

[7]  Saleh Khalaj Monfared,et al.  Fast AES Implementation: A High-Throughput Bitsliced Approach , 2019, IEEE Transactions on Parallel and Distributed Systems.

[8]  Yuan Zhao,et al.  An Efficient Elliptic Curve Cryptography Signature Server With GPU Acceleration , 2017, IEEE Transactions on Information Forensics and Security.

[9]  Xinyin Xiang,et al.  Anti-Quantum Fast Authentication and Data Transmission Scheme for Massive Devices in 5G NB-IoT System , 2019, IEEE Internet of Things Journal.

[10]  Bok-Min Goi,et al.  Parallel and High Speed Hashing in GPU for Telemedicine Applications , 2018, IEEE Access.

[11]  Anupam Chattopadhyay,et al.  PQC Acceleration Using GPUs: FrodoKEM, NewHope, and Kyber , 2021, IEEE Transactions on Parallel and Distributed Systems.

[12]  Hanho Lee,et al.  Efficient NewHope Cryptography Based Facial Security System on a GPU , 2020, IEEE Access.

[13]  Rida Khatoun,et al.  A Lightweight ECC-Based Authentication Scheme for Internet of Things (IoT) , 2020, IEEE Systems Journal.

[14]  Xiaoming Chen,et al.  moDNN: Memory Optimal Deep Neural Network Training on Graphics Processing Units , 2019, IEEE Transactions on Parallel and Distributed Systems.

[15]  Shahzad Mumtaz,et al.  Performance Comparison of GPU-Based Jacobi Solvers Using CUDA Provided Synchronization Methods , 2020, IEEE Access.

[16]  Raphael C.-W. Phan,et al.  Terabit encryption in a second: Performance evaluation of block ciphers in GPU with Kepler, Maxwell, and Pascal architectures , 2019, Concurr. Comput. Pract. Exp..

[17]  Kris Gaj,et al.  FPGA Benchmarking of Round 2 Candidates in the NIST Lightweight Cryptography Standardization Process: Methodology, Metrics, Tools, and Results , 2020, IACR Cryptol. ePrint Arch..

[18]  Mauro Conti,et al.  RARE: Defeating side channels based on data-deduplication in cloud storage , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).