Revisiting POPCOUNT Operations in CPUs / GPUs

Popcount is a binary operation where the input is a binary word and the output is the number of set bits. Popcount is a common building block for many applications such as the Hamming distance calculation. Recently, popcount is used to approximate multiplications in convolutional neural networks. Considering renewed interest in popcount, this work asks the question: can programmers lazily use the builtin popcount intrinsic or is further tuning necessary to achieve peak performance? In this work, we benchmark the efficacy of several popcount implementations on both the CPU and GPU to analyze their behaviors under different working set sizes. On the CPU, results show that for memory bound workloads, the builtin popcount compiler intrinsic is within 0.01% of the fastest hand-tuned implementations suggesting that no hand-tuning is required, while this gap is up to 60% in the compute bound scenario where hand-tuned implementations of popcount matter.

[1]  References , 1971 .

[2]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[3]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.