Population Count on Intel® CPU, GPU and FPGA

Population count is a primitive used in many applications. Commodity processors have dedicated instructions for achieving high-performance population count. Motivated by the productivity of high-level synthesis and the importance of population count, in this paper we investigated the OpenCL implementations of population count algorithms, and evaluated their performance and resource utilizations on an FPGA. Based on the results, we select the most efficient implementation. Then we derived a reduction pattern from a representative application of population count. We parallelized the reduction with atomic functions, and optimized it with vectorized memory accesses, tree reduction, and compute-unit duplication. We evaluated the performance of the reduction kernel on an InteloXeono CPU and an Intel® IrisTM Pro integrated GPU, and an FPGA card that features an Intel® Arria® 10 FPGA. When DRAM memory bandwidth is comparable on the three computing platforms, the FPGA can achieve the highest kernel performance for large workload. On the other hand, we described performance bottlenecks on the FPGA. To make FPGAs more competitive in raw performance compared to high-performant CPU and GPU platforms, it is important to increase external memory bandwidth, minimize data movement between a host and a device, and reduce OpenCL runtime overhead on an FPGA.

[1]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[2]  Bo Zhang,et al.  Secure Hamming distance based record linkage with malicious adversaries , 2014, Comput. Electr. Eng..

[3]  Ryan Kastner,et al.  Enabling FPGAs for the Masses , 2014, ArXiv.

[4]  Valery Sklyarov,et al.  On-Chip Reconfigurable Hardware Accelerators for Popcount Computations , 2016, Int. J. Reconfigurable Comput..

[5]  Satoshi Matsuoka,et al.  Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Robert G. Dimond,et al.  Accelerating Large-Scale HPC Applications Using FPGAs , 2011, 2011 IEEE 20th Symposium on Computer Arithmetic.

[7]  Hal Finkel,et al.  Evaluation of OpenCL Performance-oriented Optimizations for Streaming Kernels on the FPGA: (Abstract Only) , 2018, FPGA.

[8]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[9]  Maurice V. Wilkes,et al.  The preparation of programs for an electronic digital computer , 1958 .

[10]  Hal Finkel,et al.  A Case Study of Integer Sum Reduction using Atomics , 2018, HEART.

[11]  Wim Vanderbauwhede,et al.  High-Performance Computing Using FPGAs , 2013 .

[12]  Hal Finkel,et al.  Optimizing an Atomics-Based Reduction Kernel on OpenCL FPGA Platform , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[13]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[14]  Daniel Lemire,et al.  Faster Population Counts Using AVX2 Instructions , 2016, Comput. J..

[15]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[16]  Christoph Lange,et al.  Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project , 2016, Bioinform..

[17]  Peter M. Athanas,et al.  Enabling development of OpenCL applications on FPGA platforms , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[18]  Vaughn Betz,et al.  Architecture and CAD for Deep-Submicron FPGAS , 1999, The Springer International Series in Engineering and Computer Science.

[19]  Chenfan Sun Revisiting POPCOUNT Operations in CPUs / GPUs , 2016 .

[20]  Russell Tessier,et al.  FPGA Architecture: Survey and Challenges , 2008, Found. Trends Electron. Des. Autom..