Fast AES Implementation: A High-Throughput Bitsliced Approach

In this work, a high-throughput bitsliced AES implementation is proposed, which builds upon a new data representation scheme that exploits the parallelization capability of modern multi/many-core platforms. This representation scheme is employed as a building block to redesign all of the AES stages to tailor them for multi/many-core AES implementation. With the proposed bitsliced approach, each parallelization unit processes an unprecedented number of thirty-two 128-bit input data. Hence, a high order of prallelization is achieved by the proposed implementation technique. Based on the characteristics of this new implementation model, the ShiftRows stage can be implicitly handled through input rearrangement and is simplified to the point where its computing process can be neglected. In this implementation, costly Byte-wise operations are performed through register shift and swapping. In addition, the need for look-up table based I/O operations, which are used by the Substitute Bytes stage is eliminated through using S-box logic circuit. The S-box logic circuit is optimized to simultaneously process 32 chunks of 128-bit input data. We develop high-throughput CTR and ECB AES encryption/decryption on 6 CUDA-enabled GPUs, which achieve 1.47 and 1.38 Tbps of encryption throughput on Tesla V100 GPU, respectively.

[1]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[2]  David Canright,et al.  A Very Compact S-Box for AES , 2005, CHES.

[3]  Seungyeop Han,et al.  SSLShader: Cheap SSL Acceleration with Commodity Processors , 2011, NSDI.

[4]  Simon Heron,et al.  Encryption: Advanced Encryption Standard (AES) , 2009 .

[5]  Ingrid Verbauwhede,et al.  A Systematic Evaluation of Compact Hardware Implementations for the Rijndael S-Box , 2005, CT-RSA.

[6]  Hidema Tanaka,et al.  Throughput and Power Efficiency Evaluation of Block Ciphers on Kepler and GCN GPUs Using Micro-Benchmark Analysis , 2014, IEICE Trans. Inf. Syst..

[7]  Akashi Satoh,et al.  A Compact Rijndael Hardware Architecture with S-Box Optimization , 2001, ASIACRYPT.

[8]  Mohamed M. Fouad,et al.  High performance CUDA AES implementation: A quantitative performance analysis approach , 2017, 2017 Computing Conference.

[9]  Catherine H. Gebotys,et al.  Efficient Technique for the FPGA Implementation of the AES MixColumns Transformation , 2009, 2009 International Conference on Reconfigurable Computing and FPGAs.

[10]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[11]  Eli Biham,et al.  A Fast New DES Implementation in Software , 1997, FSE.

[12]  Xinxin Mei,et al.  Implementation and Analysis of AES Encryption on GPU , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[13]  Raphael C.-W. Phan,et al.  Fast implementation of block ciphers and PRNGs in Maxwell GPU architecture , 2016, Cluster Computing.

[14]  N. Takagi,et al.  4-bit Bit-Slice Arithmetic Logic Unit for 32-bit RSFQ Microprocessors , 2016, IEEE Transactions on Applied Superconductivity.

[15]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[16]  Omid Hajihassani,et al.  A high-performance and energy-efficient exhaustive key search approach via GPU on DES-like cryptosystems , 2020, The Journal of Supercomputing.

[17]  David Kirk,et al.  NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[18]  Saint John Walker Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2014 .

[19]  Muhammad Shiraz,et al.  Big Data: Survey, Technologies, Opportunities, and Challenges , 2014, TheScientificWorldJournal.

[20]  Rui Xu,et al.  Implementation and Evaluation of Different Parallel Designs of AES Using CUDA , 2017, 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC).

[21]  Mahmoud Al-Ayyoub,et al.  Accelerating compute intensive medical imaging segmentation algorithms using hybrid CPU-GPU implementations , 2017, Multimedia Tools and Applications.

[22]  Bruce Schneier,et al.  Description of a New Variable-Length Key, 64-bit Block Cipher (Blowfish) , 1993, FSE.

[23]  William Stallings,et al.  Cryptography and Network Security: Principles and Practice , 1998 .

[24]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2016, PPoPP 2016.

[25]  Luis C. E. Bona,et al.  Parallel speculative encryption of multiple AES contexts on GPUs , 2012 .

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Hideharu Amano,et al.  Implementation of Bitsliced AES Encryption on CUDA-Enabled GPU , 2017, NSS.

[28]  Morris J. Dworkin,et al.  Recommendation for Block Cipher Modes of Operation: Methods and Techniques , 2001 .

[29]  John Viega,et al.  Network security using OpenSSL - cryptography for secure communications , 2002 .

[30]  Linda R. Petzold,et al.  Bitsliced High-Performance AES-ECB on GPUs , 2016, The New Codebreakers.