Efficient arithmetic on ARM-NEON and its application for high-speed RSA implementation

Advanced modern processors support Single Instruction Multiple Data (SIMD) instructions (e.g. Intel-AVX, ARM-NEON) and a massive body of research on vector-parallel implementations of modular arithmetic, which are crucial components for modern public-key cryptography ranging from RSA, ElGamal, DSA and ECC, have been conducted. In this paper, we introduce a novel Double Operand Scanning (DOS) method to speed-up multi-precision squaring with non-redundant representations on SIMD architecture. The DOS technique partly doubles the operands and computes the squaring operation without ReadAfter-Write (RAW) dependencies between source and destination variables. Furthermore, we presented Karatsuba Cascade Operand Scanning (KCOS) multiplication and Karatsuba Double Operand Scanning (KDOS) squaring by adopting additive and subtractive Karatsuba’s methods, respectively. The proposed multiplication and squaring methods are compatible with separated Montgomery algorithms and these are highly efficient for RSA crypto system. Finally, our proposed multiplication/squaring, separated Montgomery multiplication/squaring and RSA encryption outperform the best-known results by 22/41%, 25/33% and 30% on the Cortex-A15 platform.

[1]  Zhe Liu,et al.  Reverse Product-Scanning Multiplication and Squaring on 8-Bit AVR Processors , 2014, ICICS.

[2]  Hwajeong Seo,et al.  Multi-precision Multiplication for Public-Key Cryptography on Embedded Microprocessors , 2012, WISA.

[3]  Tanja Lange,et al.  Curve41417: Karatsuba revisited , 2014, IACR Cryptol. ePrint Arch..

[4]  Stefan Mangard,et al.  Power analysis attacks - revealing the secrets of smart cards , 2007 .

[5]  Peter Schwabe,et al.  Multiprecision multiplication on AVR revisited , 2015, Journal of Cryptographic Engineering.

[6]  Marcelo E. Kaihara,et al.  Montgomery Multiplication on the Cell , 2009, PPAM.

[7]  Zhe Liu,et al.  New Speed Records for Montgomery Modular Multiplication on 8-Bit AVR Microcontrollers , 2014, AFRICACRYPT.

[8]  Younho Lee,et al.  Improved multi-precision squaring for low-end RISC microcontrollers , 2013, J. Syst. Softw..

[9]  Anatolij A. Karatsuba,et al.  Multiplication of Multidigit Numbers on Automata , 1963 .

[10]  Shay Gueron,et al.  Software Implementation of Modular Exponentiation, Using Advanced Vector Instructions Architectures , 2012, WAIFI.

[11]  Paulo Martins,et al.  On the Evaluation of Multi-core Systems with SIMD Engines for Public-Key Cryptography , 2014, 2014 International Symposium on Computer Architecture and High Performance Computing Workshop.

[12]  Paulo Martins,et al.  Stretching the limits of Programmable Embedded Devices for Public-key Cryptography , 2015, CS2@HiPEAC.

[13]  Zhe Liu,et al.  Montgomery Modular Multiplication on ARM-NEON Revisited , 2014, ICISC.

[14]  Ç. Koç,et al.  Incomplete reduction in modular arithmetic , 2002 .

[15]  Hwajeong Seo,et al.  Multi-precision Squaring for Public-Key Cryptography on Embedded Microprocessors, a Step Forward , 2016, WISA.

[16]  Paul G. Comba,et al.  Exponentiation Cryptosystems on the IBM PC , 1990, IBM Syst. J..

[17]  Hans Eberle,et al.  Comparing Elliptic Curve Cryptography and RSA on 8-bit CPUs , 2004, CHES.

[18]  Roberto Maria Avanzi,et al.  Energy-Efficient Software Implementation of Long Integer Modular Arithmetic , 2005, CHES.

[19]  Patrick Schaumont,et al.  SIMD acceleration of modular arithmetic on contemporary embedded platforms , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[20]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[21]  Hwajeong Seo,et al.  Optimized Multi-Precision Multiplication for Public-Key Cryptography on Embedded Microprocessors , 2013 .

[22]  Daniel Shumow,et al.  Montgomery Multiplication Using Vector Instructions , 2013, Selected Areas in Cryptography.

[23]  C. D. Walter,et al.  Distinguishing Exponent Digits by Observing Modular Subtractions , 2001, CT-RSA.

[24]  Alfred Menezes,et al.  Handbook Of Applied Cryptography Crc Press , 2015 .

[25]  Zhe Liu,et al.  Multi-precision Squaring for Public-Key Cryptography on Embedded Microprocessors , 2013, INDOCRYPT.

[26]  Zhe Liu,et al.  Optimized Karatsuba squaring on 8-bit AVR processors , 2014, Secur. Commun. Networks.

[27]  Peter Schwabe,et al.  NEON Crypto , 2012, CHES.

[28]  Daniel J. Bernstein,et al.  Batch Binary Edwards , 2009, CRYPTO.

[29]  P. L. Montgomery Modular multiplication without trial division , 1985 .