FPGA Implementation of SQUASH

Passive RFID tags are devices with very poor computational capability. How- ever an increasing number of applications require authentication of the tag. For this purpose, a simple solution is to use a challenge-response protocol. For exam- ple, the reader can send a random challenge R and the tag responds H(R + S), where S is a secret known by the reader and H is a hash function. Then, the reader can check whether the tag knows S. SQUASH, introduced by Adi Shamir in February 2008 [1], is a new hash function designed for this task. In this ar- ticle, we describe an FPGA implementation of this algorithm minimizing the resources. SQUASH is based on the one-way function c = m2 mod n coming from the Rabin cryptosystem [2]. To make it secure, the binary length of n must be at least 1000 bits long [8]. In [1], the author suggests using a 64-bit non-linear feedback shift register to generate m, a not yet factorized Mersenne's number (2x - 1) as modulus n, and to send out the bits of c without storing them. This process avoids storing three 1000-bit long numbers. The multiplications are achieved by on-the- y convolutions, sending each bit as soon as it is computed. Consequently, the only needed memory aims at storing the carry of the previous steps in the convolution. For the output, a window of 32 or 64 (or more) bits in c is used. It yields a hash function with inputs of 64 bits that is scalable in output. In this paper, we propose implementations designed in order to minimize the resources, possibly at the cost of an increased execution time. The target device is a Xilinx Virtex-4 XC4VLX200-10 FPGA. The algorithm recommended by Adi Shamir has 64 bits in input and 32 in output, and n = 21277 - 1. To reduce the hardware cost, we minimized the number of registers in the implementation data and control part. On the XC4VLX200, the design results require 377 slices. The full execution time to produce 32 bits is 63,250 clock cycles at 222 MHz, so we reach a throughput of 112,300 bits per second. We also implemented the algorithm with other size numbers. For 128 bits in input and 64 in output, we get 619 slices and 104,114 clock cycles at 206 MHz. In general, the length of the output in uences the execution time while the length of the input in uences the number of registers and slices.