Fly, you fool! Faster Frodo for the ARM Cortex-M4

We present an efficient implementation of FrodoKEM-640 on an ARM Cortex-M4 core. We leverage the single instruction, multiple data paradigm, available in the instruction set of the ARM Cortex-M4, together with a careful analysis of the memory layout of matrices to considerably speed up matrix multiplications. Our implementations take up to 79.4% less cycles than the reference. Moreover, we challenge the usage of a cryptographically secure pseudorandom number generator for the generation of the large public matrix involved. We argue that statistically good pseudorandomness is enough to achieve the same security goal. Therefore, we propose to use xoshiro128∗∗ as a PRNG instead: its structure can be easily integrated in FrodoKEM-640, it passes all known statistical tests and greatly outperforms previous choices. By using xoshiro128∗∗ we improve the generation of the large public matrix, which is a considerable bottleneck for embedded devices, by up to 96%.