sDPF-RSA: Utilizing Floating-point Computing Power of GPUs for Massive Digital Signature Computations

In financial, electronic and other security-sensitive industries, data centers require various protocols and algorithms to secure massive volumes of transactions. It is well known that digital signature is a computationally expensive task and a potential bottleneck that can restrict overall performance. In this paper, we make the following contributions. First, we propose a novel method called sDPF-RSA to accelerate the core algorithm of RSA, Montgomery multiplication, for Graphics Processing Units (GPUs). The sDPF approach takes advantage of the sign bit to increase the amount of information processed with each double precision floating point value and considerably improves performance. Second, we have comprehensively reviewed and tested the algorithms to ensure they all run in constant time. In particular we improve the standard carry resolution algorithm, introducing two constant time parallel techniques. We thus minimize the potential for timing attacks against GPU based RSA crypto-systems. Finally, we propose a full implementation of RSA, optimized for our GPU-accelerated computing platform to maximize its computing power. With protection against timing attacks, the throughputs of RSA-2048/3072/4096 on an NVIDIA GeForce GTX TITAN Black set a record of 52,747/15,179/6,435 (for signature generation) and 1,237,694/584,083/354,139 (for signature verification with public key 65,537) operations per second with modest latency, outperforming the contemporaneous CPU and many-core processor Xeon Phi by 3.9-11 times.

[1]  P. L. Montgomery Modular multiplication without trial division , 1985 .

[2]  Tim Güneysu,et al.  Exploiting the Power of GPUs for Asymmetric Cryptography , 2008, CHES.

[3]  Werner Schindler,et al.  A Timing Attack against RSA with the Chinese Remainder Theorem , 2000, CHES.

[4]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[5]  J. Quisquater,et al.  Fast decipherment algorithm for RSA public-key cryptosystem , 1982 .

[6]  Paul C. Kocher,et al.  Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems , 1996, CRYPTO.

[7]  Simon Josefsson,et al.  Edwards-Curve Digital Signature Algorithm (EdDSA) , 2017, RFC.

[8]  Yuan Zhao,et al.  Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration , 2017, Secur. Commun. Networks.

[9]  Shun Yao,et al.  PhiOpenSSL: Using the Xeon Phi Coprocessor for Efficient Cryptographic Calculations , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Adi Shamir,et al.  A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.

[11]  Yuan Zhao,et al.  Exploiting the Floating-Point Computing Power of GPUs for RSA , 2014, ISC.

[12]  Donald E. Knuth,et al.  The art of computer programming. Vol.2: Seminumerical algorithms , 1981 .

[13]  David R. Kaeli,et al.  A complete key recovery timing attack on a GPU , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  William M. Daley,et al.  Digital Signature Standard (DSS) , 2000 .

[15]  Seungyeop Han,et al.  SSLShader: Cheap SSL Acceleration with Commodity Processors , 2011, NSDI.

[16]  Chen,et al.  The billion-mulmod-per-second PC , 2009 .

[17]  Richard J. Lipton,et al.  On the Importance of Checking Cryptographic Protocols for Faults (Extended Abstract) , 1997, EUROCRYPT.

[18]  Charles C. Weems,et al.  Pushing the Performance Envelope of Modular Exponentiation Across Multiple Generations of GPUs , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[19]  Dirk Fox,et al.  Digital Signature Standard (DSS) , 2001, Datenschutz und Datensicherheit.

[20]  Zhi Guan,et al.  Accelerating RSA with Fine-Grained Parallelism Using GPU , 2015, ISPEC.

[21]  Ç. Koç Analysis of sliding window techniques for exponentiation , 1995 .

[22]  Donald E. Knuth The Art of Computer Programming 2 / Seminumerical Algorithms , 1971 .

[23]  Tolga Acar,et al.  Analyzing and comparing Montgomery multiplication algorithms , 1996, IEEE Micro.

[24]  John Waldron,et al.  Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware , 2009, AFRICACRYPT.

[25]  Tanja Lange,et al.  ECM on Graphics Cards , 2009, IACR Cryptol. ePrint Arch..

[26]  Francisco Rodríguez-Henríquez,et al.  A GPU Parallel Implementation of the RSA Private Operation , 2016, CARLA.

[27]  Holger Orup,et al.  Simplifying quotient determination in high-radix modular multiplication , 1995, Proceedings of the 12th Symposium on Computer Arithmetic.

[28]  Yuan Zhao,et al.  An Efficient Elliptic Curve Cryptography Signature Server With GPU Acceleration , 2017, IEEE Transactions on Information Forensics and Security.

[29]  Samuel Neves,et al.  On the performance of GPU public-key cryptography , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.