Montgomery modular multiplication constitutes the “arithmetic foundation” of modern public-key cryptography with applications ranging from RSA, DSA and Diffie-Hellman over elliptic curve schemes to pairing-based cryptosystems. The increased prevalence of SIMD-type instructions in commodity processors (e.g. Intel SSE, ARM NEON) has initiated a massive body of research on vector-parallel implementations of Montgomery modular multiplication. In this paper, we introduce the Cascade Operand Scanning (COS) method to speed up multi-precision multiplication on SIMD architectures. We developed the COS technique with the goal of reducing Read-After-Write (RAW) dependencies in the propagation of carries, which also reduces the number of pipeline stalls (i.e. bubbles). The COS method operates on 32-bit words in a row-wise fashion (similar to the operand-scanning method) and does not require a “non-canonical” representation of operands with a reduced radix. We show that two COS computations can be “coarsely” integrated into an efficient vectorized variant of Montgomery multiplication, which we call Coarsely Integrated Cascade Operand Scanning (CICOS) method. Due to our sophisticated instruction scheduling, the CICOS method reaches record-setting execution times for Montgomery modular multiplication on ARM-NEON platforms. Detailed benchmarking results obtained on an ARM Cortex-A9 and Cortex-A15 processors show that the proposed CICOS method outperforms Bos et al’s implementation from SAC 2013 by up to 57 % (A9) and 40 % (A15), respectively.
[1]
Shay Gueron,et al.
Software Implementation of Modular Exponentiation, Using Advanced Vector Instructions Architectures
,
2012,
WAIFI.
[2]
Paul Barrett,et al.
Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor
,
1986,
CRYPTO.
[3]
Francisco Rodríguez-Henríquez,et al.
NEON Implementation of an Attribute-Based Encryption Scheme
,
2013,
ACNS.
[4]
P. L. Montgomery.
Modular multiplication without trial division
,
1985
.
[5]
Patrick Longa,et al.
Efficient and Secure Algorithms for GLV-Based Scalar Multiplication and Their Implementation on GLV-GLS Curves
,
2014,
CT-RSA.
[6]
Marcelo E. Kaihara,et al.
Montgomery Multiplication on the Cell
,
2009,
PPAM.
[7]
Peter Schwabe,et al.
NEON Crypto
,
2012,
CHES.
[8]
Daniel Shumow,et al.
Montgomery Multiplication Using Vector Instructions
,
2013,
Selected Areas in Cryptography.
[9]
Ricardo Dahab,et al.
Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine
,
2013,
CD-ARES Workshops.
[10]
Patrick Schaumont,et al.
Cryptographic Hardware and Embedded Systems – CHES 2012
,
2012,
Lecture Notes in Computer Science.
[11]
Patrick Schaumont,et al.
SIMD acceleration of modular arithmetic on contemporary embedded platforms
,
2013,
2013 IEEE High Performance Extreme Computing Conference (HPEC).