Faster Population Counts Using AVX2 Instructions

Counting the number of ones in a binary stream is a common operation in database, information-retrieval, cryptographic and machine-learning applications. Most processors have dedicated instructions to count the number of ones in a word (e.g., popcnt on x64 processors). Maybe surprisingly, we show that a vectorized approach using SIMD instructions can be twice as fast as using the dedicated instructions on recent Intel processors. The benefits can be even greater for applications such as similarity measures (e.g., the Jaccard index) that require additional Boolean operations. Our approach has been adopted by LLVM: it is used by its popular C compiler (Clang).

[1]  Maurice V. Wilkes,et al.  The preparation of programs for an electronic digital computer , 1958 .

[2]  Maurice V. Wilkes,et al.  The Preparation of Programs for an Electronic Digital Computer (Charles Babbage Institute Reprint) , 1958 .

[3]  Peter Wegner,et al.  A technique for counting ones in a binary computer , 1960, CACM.

[4]  Maarten de Rijke,et al.  Counting Objects , 1995, J. Log. Comput..

[5]  Y. Hilewitz,et al.  Comparing fast implementations of bit permutation instructions , 2004, Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, 2004..

[6]  Stephen Hill Design of a reusable 1GHz, superscalar ARM processor , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[7]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[8]  Andy Oram,et al.  Beautiful Code: Leading Programmers Explain How They Think (Theory in Practice (O'Reilly)) , 2007 .

[9]  Tauno Kekäle,et al.  Beautiful Code. Leading Programmers Explain How They Think , 2009 .

[10]  Jignesh M. Patel,et al.  WHAM: A High-Throughput Sequence Alignment Method , 2011, TODS.

[11]  A. Suciu,et al.  The never ending problem of counting bits efficiently , 2011, 2011 RoEduNet International Conference 10th Edition: Networking in Education and Research.

[12]  Elaine Shi,et al.  FastPRP: Fast Pseudo-Random Permutations for Small Domains , 2012, IACR Cryptol. ePrint Arch..

[13]  Enric Morancho A hybrid implementation of Hamming weight , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[14]  Jürgen Bajorath,et al.  Design of chemical space networks using a Tanimoto similarity variant based upon maximum common substructures , 2015, Journal of Computer-Aided Molecular Design.

[15]  André Seznec,et al.  Branch prediction and the performance of interpreters — Don't trust folklore , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[16]  N. Gotelli,et al.  Effects of neutrality, geometric constraints, climate, and habitat quality on species richness and composition of Atlantic Forest small‐mammals , 2015 .

[17]  Markus Leber,et al.  Novel genetic matching methods for handling population stratification in genome-wide association studies , 2015, BMC Bioinformatics.

[18]  Jürgen Bajorath,et al.  Erratum to: Design of chemical space networks using a Tanimoto similarity variant based upon maximum common substructures , 2015, Journal of Computer-Aided Molecular Design.

[19]  Owen Kaser,et al.  Better bitmap performance with Roaring bitmaps , 2014, Softw. Pract. Exp..

[20]  Leonid Boytsov,et al.  SIMD compression and the intersection of sorted integers , 2014, Softw. Pract. Exp..

[21]  Shay Gueron,et al.  Fast Quicksort Implementation Using AVX Instructions , 2016, Comput. J..

[22]  Christoph Lange,et al.  Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project , 2016, Bioinform..

[23]  Owen Kaser,et al.  Consistently faster and smaller compressed bitmaps with Roaring , 2016, Softw. Pract. Exp..

[24]  Chenfan Sun Revisiting POPCOUNT Operations in CPUs / GPUs , 2016 .