Accelerating RSA with Fine-Grained Parallelism Using GPU

RSA is a public key cryptography widely used for end-to-end authentication and key exchange in various Internet protocols, such as SSL and TLS. Compared with symmetric cryptography, the cryptographic operations in RSA is much more time consuming. This brings pressure on performance to service providers using secure protocols, and hinders these protocols from being more widely used. Graphics Processing Units (GPUs) are increasingly used for intensive data parallelism general purpose computing. GPUs often provide better throughput than CPUs at the same cost. In this paper, we propose a new approach to parallelize Montgomery multiplication under the Single Instruction Multiple Thread (SIMT) threading model of GPUs, and construct a parallel RSA implementation based on this approach, combining with other optimization techniques both in the algorithmic level and implementation level. The performance evaluation shows our RSA implementation achieves a record-breaking latency for RSA decryption implementations on GPUs: 2.6 ms for RSA-1024 and 6.5 ms for RSA-2048. The peak throughtput of decryptions per second of our implementation reaches 5,244 for RSA-2048 and 34,981 for RSA-1024 respectively, which is much faster than existing integer-based implementations. The peak throughput of our implementation is slightly slower than the fastest floating-point based implementation, while the latency of our implementation is 3 times faster.

[1]  Elaine B. Barker,et al.  SP 800-131A. Transitions: Recommendation for Transitioning the Use of Cryptographic Algorithms and Key Lengths , 2011 .

[2]  J. Quisquater,et al.  Fast decipherment algorithm for RSA public-key cryptosystem , 1982 .

[3]  Ç. Koç,et al.  Incomplete reduction in modular arithmetic , 2002 .

[4]  Jiawei Zhu,et al.  Accelerating AES in JavaScript with WebGL , 2013, ICICS.

[5]  Burton S. Kaliski,et al.  A Cryptographic Library for the Motorola DSP56000 , 1991, EUROCRYPT.

[6]  Laxmi N. Bhuyan,et al.  Anatomy and Performance of SSL Processing , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[7]  Samuel Neves,et al.  On the performance of GPU public-key cryptography , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[8]  P. L. Montgomery Modular multiplication without trial division , 1985 .

[9]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[10]  Nigel P. Smart,et al.  Toward Acceleration of RSA Using 3D Graphics Hardware , 2007, IMACC.

[11]  Seungyeop Han,et al.  SSLShader: Cheap SSL Acceleration with Commodity Processors , 2011, NSDI.

[12]  Tim Güneysu,et al.  Exploiting the Power of GPUs for Asymmetric Cryptography , 2008, CHES.

[13]  Tolga Acar,et al.  Analyzing and comparing Montgomery multiplication algorithms , 1996, IEEE Micro.

[14]  John Waldron,et al.  Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware , 2009, AFRICACRYPT.

[15]  Zhe Liu,et al.  High-Speed Elliptic Curve Cryptography on the NVIDIA GT200 Graphics Processing Unit , 2014, ISPEC.

[16]  Yuan Zhao,et al.  Exploiting the Floating-Point Computing Power of GPUs for RSA , 2014, ISC.

[17]  S.A. Manavski,et al.  CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography , 2007, 2007 IEEE International Conference on Signal Processing and Communications.

[18]  Adi Shamir,et al.  A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.