CFNTT: Scalable Radix-2/4 NTT Multiplication Architecture with an Efficient Conflict-free Memory Mapping Scheme

Number theoretic transform (NTT) is widely utilized to speed up polynomial multiplication, which is the critical computation bottleneck in a lot of cryptographic algorithms like lattice-based post-quantum cryptography (PQC) and homomorphic encryption (HE). One of the tendency for NTT hardware architecture is to support diverse security parameters and meet resource constraints on different computing platforms. Thus flexibility and Area-Time Product (ATP) become two crucial metrics in NTT hardware design. The flexibility of NTT in terms of different vector sizes and moduli can be obtained directly. Whereas the varying strides in memory access of in-place NTT render the design for different radix and number of parallel butterfly units a tough problem. This paper proposes an efficient conflict-free memory mapping scheme that supports the configuration for both multiple parallel butterfly units and arbitrary radix of NTT. Compared to other approaches, this scheme owns broader applicability and facilitates the parallelization of non-radix-2 NTT hardware design. Based on this scheme, we propose a scalable radix-2 and radix-4 NTT multiplication architecture by algorithm-hardware co-design. A dedicated schedule method is leveraged to reduce the number of modular additions/subtractions and modular multiplications in radix-4 butterfly unit by 20% and 33%, respectively. To avoid the bit-reversed cost and save memory footprint in arbitrary radix NTT/INTT, we put forward a general method by rearranging the loop structure and reusing the twiddle factors. The hardware-level optimization is achieved by excavating the symmetric operators in radix-4 butterfly unit, which saves almost 50% hardware resources compared to a straightforward implementation. Through experimental results and theoretical analysis, we point out that the radix-4 NTT with the same number of parallel butterfly units outperforms the radix-2 NTT in terms of area-time performance in the interleaved memory system. This advantage is enlarged when increasing the number of parallel butterfly units. For example, when processing 1024 14-bit points NTT with 8 parallel butterfly units, the ATP of LUT/FF/DSP/BRAM n radix-4 NTT core is approximately 2.2 × /1.2 × /1.1 × /1.9 × less than that of the radix-2 NTT core on a similar FPGA platform.

[1]  Aydin Aysu,et al.  An Extensive Study of Flexible Design Methods for the Number Theoretic Transform , 2020, IEEE Transactions on Computers.

[2]  Xiang Feng,et al.  Accelerating an FHE Integer Multiplier Using Negative Wrapped Convolution and Ping-Pong FFT , 2019, IEEE Transactions on Circuits and Systems II: Express Briefs.

[3]  Mats Torkelson,et al.  A new approach to pipeline FFT processor , 1996, Proceedings of International Conference on Parallel Processing.

[4]  Ingrid Verbauwhede,et al.  Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography , 2021, IACR Cryptol. ePrint Arch..

[5]  Peter Schwabe,et al.  Memory-Efficient High-Speed Implementation of Kyber on Cortex-M4 , 2019, IACR Cryptol. ePrint Arch..

[6]  Qian-Jian Xing,et al.  A Novel Conflict-Free Parallel Memory Access Scheme for FFT Processors , 2017, IEEE Transactions on Circuits and Systems II: Express Briefs.

[7]  Jarmo Takala,et al.  Conflict-free parallel memory access scheme for FFT processors , 2003, Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS '03..

[8]  Shuguo Li,et al.  A Compact Hardware Implementation of CCA-Secure Key Exchange Mechanism CRYSTALS-KYBER on FPGA , 2021, IACR Trans. Cryptogr. Hardw. Embed. Syst..

[9]  Peter W. Shor,et al.  Algorithms for quantum computation: discrete logarithms and factoring , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[10]  Bo-Yin Yang,et al.  NTT Multiplication for NTT-unfriendly Rings New Speed Records for Saber and NTRU on Cortex-M4 and AVX2 , 2021, IACR Trans. Cryptogr. Hardw. Embed. Syst..

[11]  Anupam Chattopadhyay,et al.  SPQCop: Side-channel protected Post-Quantum Cryptoprocessor , 2019, IACR Cryptol. ePrint Arch..

[12]  Tim Güneysu,et al.  High-Performance Ideal Lattice-Based Cryptography on 8-Bit ATxmega Microcontrollers , 2015, LATINCRYPT.

[13]  Dionysios I. Reisis,et al.  Conflict-Free Parallel Memory Accessing Techniques for FFT Architectures , 2008, IEEE Transactions on Circuits and Systems I: Regular Papers.

[14]  Chen Chen,et al.  Highly Efficient Architecture of NewHope-NIST on FPGA using Low-Complexity NTT/INTT , 2020, IACR Trans. Cryptogr. Hardw. Embed. Syst..

[15]  Frederik Vercauteren,et al.  High-Speed Polynomial Multiplication Architecture for Ring-LWE and SHE Cryptosystems , 2015, IEEE Transactions on Circuits and Systems I: Regular Papers.

[16]  Lewis Johnson,et al.  Conflict free memory addressing for dedicated FFT hardware , 1992 .

[17]  Joos Vandewalle,et al.  Comparison of Three Modular Reduction Functions , 1993, CRYPTO.

[18]  E.E. Swartzlander,et al.  A radix 4 delay commutator for fast Fourier transform processor implementation , 1984, IEEE Journal of Solid-State Circuits.

[19]  Craig Gentry,et al.  Fully homomorphic encryption using ideal lattices , 2009, STOC '09.

[20]  Xu Cheng,et al.  VPQC: A Domain-Specific Vector Processor for Post-Quantum Cryptography Based on RISC-V Architecture , 2020, IEEE Transactions on Circuits and Systems I: Regular Papers.

[21]  Jesús Grajal,et al.  Pipelined Radix-$2^{k}$ Feedforward FFT Architectures , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[22]  Peter J. Nicholson,et al.  Algebraic Theory of Finite Fourier Transforms , 1971, Journal of computer and system sciences (Print).

[23]  Bin Wu,et al.  A Memory-Based FFT Processor Design With Generalized Efficient Conflict-Free Address Schemes , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Vadim Lyubashevsky,et al.  Practical Lattice-Based Zero-Knowledge Proofs for Integer Relations , 2020, IACR Cryptol. ePrint Arch..

[25]  Mark Horowitz,et al.  Building Conflict-Free FFT Schedules , 2015, IEEE Transactions on Circuits and Systems I: Regular Papers.

[26]  Harold S. Stone R66-50 An Algorithm for the Machine Calculation of Complex Fourier Series , 1966 .

[27]  Peter Schwabe,et al.  Software Speed Records for Lattice-Based Signatures , 2013, PQCrypto.

[28]  Shuguo Li,et al.  An Efficient Implementation of the NewHope Key Exchange on FPGAs , 2020, IEEE Transactions on Circuits and Systems I: Regular Papers.

[29]  Chen-Mou Cheng,et al.  High Performance Post-Quantum Key Exchange on FPGAs , 2021, J. Inf. Sci. Eng..

[30]  Erdem Alkim,et al.  Cortex-M4 Optimizations for \{R, M\}LWE Schemes , 2020, IACR Cryptol. ePrint Arch..

[31]  Frederik Vercauteren,et al.  Compact Ring-LWE Cryptoprocessor , 2014, CHES.

[32]  Xiang Feng,et al.  RLWE-Oriented High-Speed Polynomial Multiplier Utilizing Multi-Lane Stockham NTT Algorithm , 2020, IEEE Transactions on Circuits and Systems II: Express Briefs.

[33]  Kris Gaj,et al.  A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms , 2019, 2019 International Conference on Field-Programmable Technology (ICFPT).

[34]  Julian Wälde,et al.  Polynomial Multiplication in NTRU Prime: Comparison of Optimization Strategies on Cortex-M4 , 2020, IACR Cryptol. ePrint Arch..

[35]  Anantha P. Chandrakasan,et al.  Sapphire: A Configurable Crypto-Processor for Post-Quantum Lattice-based Protocols , 2019, IACR Trans. Cryptogr. Hardw. Embed. Syst..

[36]  Ray A. Perlner,et al.  Status report on the second round of the NIST post-quantum cryptography standardization process , 2020 .

[37]  Sau-Gee Chen,et al.  Reconfigurable Radix-2k×3 Feedforward FFT Architectures , 2019, 2019 IEEE International Symposium on Circuits and Systems (ISCAS).

[38]  W. M. Gentleman,et al.  Fast Fourier Transforms: for fun and profit , 1966, AFIPS '66 (Fall).

[39]  Xinming Huang,et al.  VLSI Design of a Large-Number Multiplier for Fully Homomorphic Encryption , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[40]  Chien-Ming Wu,et al.  Design of an efficient FFT processor for DAB system , 2001, ISCAS 2001. The 2001 IEEE International Symposium on Circuits and Systems (Cat. No.01CH37196).

[41]  Michela Becchi,et al.  A Flexible and Scalable NTT Hardware : Applications from Homomorphically Encrypted Deep Learning to Post-Quantum Cryptography , 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).