Comparison of parallelized radix-2 and radix-4 scalable Montgomery multipliers

This paper compares 130nm custom silicon implementations of three scalable Montgomery multiplier architectures to previously published FPGA implementations of the same architectures. It investigates the delay, energy, and area tradeoffs of parallelized left-shifting radix-2, radix-4, and Booth-encoded radix-4 architectures. The radix-4 architecture is most efficient, performing 256 × 256-bit modular multiplication in 453ns while consuming 15.7nJ of energy and occupying an area of 0.141mm2. The radix-2 architecture is a close second, with an energy-delay product (EDP) 0.8% higher and an area-delay product (ADP) 3.1% higher. The Booth-encoded radix-4 architecture eliminates the need for an adder generating a 3× multiple, but comes at a cost of 36% in EDP and 34% in ADP relative to the conventional radix-4 architecture. The relative efficiencies of the silicon implementations are consistent with the FPGA implementations.

[1]  D. Harris,et al.  Parallelized Very High Radix Scalable Montgomery Multipliers , 2005, Conference Record of the Thirty-Ninth Asilomar Conference onSignals, Systems and Computers, 2005..

[2]  David Money Harris,et al.  Parallelized radix-4 scalable montgomery multipliers , 2007, SBCCI '07.

[3]  David Money Harris,et al.  Parallelized Booth-Encoded Radix-4 Montgomery Multipliers , 2008 .

[4]  Holger Orup,et al.  Simplifying quotient determination in high-radix modular multiplication , 1995, Proceedings of the 12th Symposium on Computer Arithmetic.

[5]  P. L. Montgomery Modular multiplication without trial division , 1985 .

[6]  Sanu Mathew,et al.  An improved unified scalable radix-2 Montgomery multiplier , 2005, 17th IEEE Symposium on Computer Arithmetic (ARITH'05).

[7]  Nan Jiang,et al.  Parallelized radix-2 scalable Montgomery multiplier , 2007, 2007 IFIP International Conference on Very Large Scale Integration.