论文信息 - Dr. Top-k: Delegate-Centric Top-k on GPUs

Dr. Top-k: Delegate-Centric Top-k on GPUs

Recent top-k computation efforts explore the possibility of revising various sorting algorithms to answer top-k queries on GPUs. These endeavors, unfortunately, perform significantly more work than needed. This paper introduces Dr. Top-k, a Delegate-centric top-k system on GPUs that can reduce the top-k workloads significantly. Particularly, it contains three major contributions: First, we introduce a comprehensive design of the delegate-centric concept, including maximum delegate, delegate-based filtering, and β delegate mechanisms to help reduce the workload for top-k up to more than 99%. Second, due to the difficulty and importance of deriving a proper subrange size, we perform a rigorous theoretical analysis, coupled with thorough experimental validations to identify the desirable subrange size. Third, we introduce four key system optimizations to enable fast multi-GPU top-k computation. Taken together, this work constantly outperforms the state-of-the-art.

[1] Frank Dehne,et al. Parallel Sorting for GPUs , 2017 .

[2] Christopher Root,et al. MapD: a GPU-powered big data analytics and visualization platform , 2016, SIGGRAPH Talks.

[3] Ling Liu,et al. Extracting top-k most influential nodes by activity analysis , 2014, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014).

[4] Jeffrey D. Blanchard,et al. Fast k-selection algorithms for graphics processing units , 2012, JEAL.

[5] Samuel Madden,et al. Efficient Top-K Query Processing on Massively Parallel Hardware , 2018, SIGMOD Conference.

[6] Ryan A. Rossi,et al. The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[7] Gunter Saake,et al. Ocelot/HyPE: Optimized Data Processing on Heterogeneous Hardware , 2014, Proc. VLDB Endow..

[8] Hong Chen,et al. A Memory Access Reduced Sort on Multi-core GPU , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[9] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[10] Jinli Cao,et al. Trustworthy answers for top-k queries on uncertain Big Data in decision making , 2015, Inf. Sci..

[11] John D. Leidel,et al. Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity , 2018 .

[12] Hang Liu,et al. GSoFa: Scalable Sparse LU Symbolic Factorization on GPUs , 2020, ArXiv.

[13] Hartwig Anzt,et al. Parallel selection on GPUs , 2020, Parallel Comput..

[14] Hang Liu,et al. Deanonymizing Cryptocurrency With Graph Learning: The Promises and Challenges , 2019, 2019 IEEE Conference on Communications and Network Security (CNS).

[15] Jeff Johnson,et al. Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[16] Yury A. Malkov,et al. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] Xiaoye S. Li,et al. C-SAW: A Framework for Graph Sampling and Random Walk on GPUs , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Fan Yao,et al. XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs , 2019, HPDC.

[19] Hans-Arno Jacobsen,et al. A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs , 2017, SIGMOD Conference.

[20] Mihai F. Ionescu,et al. Optimizing parallel bitonic sort , 1997, Proceedings 11th International Parallel Processing Symposium.

[21] Minjia Zhang,et al. HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory , 2020, NeurIPS.

[22] G. Bosilca,et al. FFT-based Gradient Sparsification for the Distributed Training of Deep Neural Networks , 2020, HPDC.

[23] Yan Luo,et al. Do Bitcoin Users Really Care About Anonymity? An Analysis of the Bitcoin Transaction Graph , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[24] Shuaiwen Song,et al. Warp-Consolidation: A Novel Execution Model for GPUs , 2018, ICS.

[25] Hengyong Yu,et al. EZLDA: Efficient and Scalable LDA on GPUs , 2020, ArXiv.

[26] Carl Yang. Tree-based Allreduce Communication on MXNet , 2019 .

[27] Torsten Suel,et al. Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[28] Julian Shun,et al. Theoretically-Efficient and Practical Parallel In-Place Radix Sorting , 2019, SPAA.

[29] Vassilis J. Tsotras,et al. Efficient Main-Memory Top-K Selection For Multicore Architectures , 2019, Proc. VLDB Endow..

[30] Huy L. Nguyen. Approximate Nearest Neighbor Search in ℓp , 2013, ArXiv.

[31] Cordelia Schmid,et al. Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32] Yang Liu,et al. Griffin: uniting CPU and GPU in information retrieval systems for intra-query parallelism , 2018, PPoPP.

[33] Nathan Bell,et al. Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[34] Maya Gokhale,et al. Hardware Technologies for High-Performance Data-Intensive Computing , 2008, Computer.

[35] Andrei Z. Broder,et al. Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.