Using the Integrated GPU to Improve CPU Sort Performance

In this paper we discuss the potential of the integrated GPU to accelerate sorting by performing a partial sort prior to a comparison based CPU sort. We experiment along with several CPU comparison based sorting algorithms and outline the performance gain for a random input data set. We then analyze different x86 SoC architectures, and show that by sorting chunks stored inside the onchip GPU memory, we can almost eliminate the impact the memory hierarchy has on performance. Finally, we discuss how our approach is different from previous designs, being specifically tailored for an SoC with an integrated GPU, able to improve on most of its known limitations (memory bandwidth, complexity, performance etc).

[1]  Grigore Lupescu,et al.  Commodity hardware performance in AES processing , 2014, 2014 IEEE 13th International Symposium on Parallel and Distributed Computing.

[2]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[3]  Bingsheng He,et al.  In-Cache Query Co-Processing on Coupled CPU-GPU Architectures , 2014, Proc. VLDB Endow..

[4]  Matt Pharr,et al.  Gpu gems 2: programming techniques for high-performance graphics and general-purpose computation , 2005 .

[5]  Alastair F. Donaldson,et al.  The Hitchhiker's Guide to Cross-Platform OpenCL Application Development , 2016, IWOCL.

[6]  Ümit Y. Ogras,et al.  Adaptive performance prediction for integrated GPUs , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[7]  Nikil D. Dutt,et al.  Co-Cap: energy-efficient cooperative CPU-GPU frequency capping for mobile games , 2016, SAC.

[8]  Fabrizio Silvestri,et al.  Sorting on GPUs for large scale datasets: A thorough comparison , 2012, Inf. Process. Manag..

[9]  Grigore Lupescu,et al.  Analysis of thread workgroup broadcast for Intel GPUs , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[10]  Albert Akhriev,et al.  A comparative study of parallel sort algorithms , 2011, OOPSLA Companion.

[11]  Shuai Che,et al.  Betweenness Centrality in an HSA-enabled System , 2016, HPGP@HPDC.

[12]  Daniel A. Jiménez,et al.  Adaptive GPU cache bypassing , 2015, GPGPU@PPoPP.