Dr. Top-k: Delegate-Centric Top-k on GPUs

Recent top-k computation efforts explore the possibility of revising various sorting algorithms to answer top-k queries on GPUs. These endeavors, unfortunately, perform significantly more work than needed. This paper introduces Dr. Top-k, a Delegate-centric top-k system on GPUs that can reduce the top-k workloads significantly. Particularly, it contains three major contributions: First, we introduce a comprehensive design of the delegate-centric concept, including maximum delegate, delegate-based filtering, and β delegate mechanisms to help reduce the workload for top-k up to more than 99%. Second, due to the difficulty and importance of deriving a proper subrange size, we perform a rigorous theoretical analysis, coupled with thorough experimental validations to identify the desirable subrange size. Third, we introduce four key system optimizations to enable fast multi-GPU top-k computation. Taken together, this work constantly outperforms the state-of-the-art.

[1]  Frank Dehne,et al.  Parallel Sorting for GPUs , 2017 .

[2]  Christopher Root,et al.  MapD: a GPU-powered big data analytics and visualization platform , 2016, SIGGRAPH Talks.

[3]  Ling Liu,et al.  Extracting top-k most influential nodes by activity analysis , 2014, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014).

[4]  Jeffrey D. Blanchard,et al.  Fast k-selection algorithms for graphics processing units , 2012, JEAL.

[5]  Samuel Madden,et al.  Efficient Top-K Query Processing on Massively Parallel Hardware , 2018, SIGMOD Conference.

[6]  Ryan A. Rossi,et al.  The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[7]  Gunter Saake,et al.  Ocelot/HyPE: Optimized Data Processing on Heterogeneous Hardware , 2014, Proc. VLDB Endow..

[8]  Hong Chen,et al.  A Memory Access Reduced Sort on Multi-core GPU , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[9]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[10]  Jinli Cao,et al.  Trustworthy answers for top-k queries on uncertain Big Data in decision making , 2015, Inf. Sci..

[11]  John D. Leidel,et al.  Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity , 2018 .

[12]  Hang Liu,et al.  GSoFa: Scalable Sparse LU Symbolic Factorization on GPUs , 2020, ArXiv.

[13]  Hartwig Anzt,et al.  Parallel selection on GPUs , 2020, Parallel Comput..

[14]  Hang Liu,et al.  Deanonymizing Cryptocurrency With Graph Learning: The Promises and Challenges , 2019, 2019 IEEE Conference on Communications and Network Security (CNS).

[15]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[16]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Xiaoye S. Li,et al.  C-SAW: A Framework for Graph Sampling and Random Walk on GPUs , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Fan Yao,et al.  XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs , 2019, HPDC.

[19]  Hans-Arno Jacobsen,et al.  A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs , 2017, SIGMOD Conference.

[20]  Mihai F. Ionescu,et al.  Optimizing parallel bitonic sort , 1997, Proceedings 11th International Parallel Processing Symposium.

[21]  Minjia Zhang,et al.  HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory , 2020, NeurIPS.

[22]  G. Bosilca,et al.  FFT-based Gradient Sparsification for the Distributed Training of Deep Neural Networks , 2020, HPDC.

[23]  Yan Luo,et al.  Do Bitcoin Users Really Care About Anonymity? An Analysis of the Bitcoin Transaction Graph , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[24]  Shuaiwen Song,et al.  Warp-Consolidation: A Novel Execution Model for GPUs , 2018, ICS.

[25]  Hengyong Yu,et al.  EZLDA: Efficient and Scalable LDA on GPUs , 2020, ArXiv.

[26]  Carl Yang Tree-based Allreduce Communication on MXNet , 2019 .

[27]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[28]  Julian Shun,et al.  Theoretically-Efficient and Practical Parallel In-Place Radix Sorting , 2019, SPAA.

[29]  Vassilis J. Tsotras,et al.  Efficient Main-Memory Top-K Selection For Multicore Architectures , 2019, Proc. VLDB Endow..

[30]  Huy L. Nguyen Approximate Nearest Neighbor Search in ℓp , 2013, ArXiv.

[31]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Yang Liu,et al.  Griffin: uniting CPU and GPU in information retrieval systems for intra-query parallelism , 2018, PPoPP.

[33]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[34]  Maya Gokhale,et al.  Hardware Technologies for High-Performance Data-Intensive Computing , 2008, Computer.

[35]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.