QTIB: Quick bit-reversed permutations on CPUs

We present a fast algorithm for out-of-place bit-reversed permutation of large vectors for input to an FFT. It is an extension of two previously published methods with special consideration of advanced CPU hardware features. In particular, the method makes heavy use of cache prefetching, MMX and SSE units, and write-combining buffers. Implementations have been made in assembly language for 2-byte and 4-byte operands. In terms of efficiency the method significantly outperforms previously reported methods.

[1]  Alan H. Karp Bit Reversal on Uniprocessors , 1996, SIAM Rev..

[2]  Alan Mycroft,et al.  Optimal bit-reversal using vector permutations , 2007, SPAA '07.

[3]  Zhao Zhang,et al.  Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors , 2000, SIAM J. Sci. Comput..

[4]  Anton Lokhmotov,et al.  Programming and compiling for embedded SIMD architectures , 2008 .

[5]  Larry Carter,et al.  Memory hierarchy considerations for fast transpose and bit-reversals , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[6]  Karim Drouiche,et al.  A new superfast bit reversal algorithm , 2002 .