Efficient Data-Parallel Primitives on Heterogeneous Systems
暂无分享,去创建一个
[1] Volker Markl,et al. Hardware-Oblivious Parallelism for In-Memory Column-Stores , 2013, Proc. VLDB Endow..
[2] Karl Rupp,et al. Performance portability study of linear algebra kernels in OpenCL , 2014, IWOCL '14.
[3] Pradeep Dubey,et al. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..
[4] Bruce Merry,et al. A Performance Comparison of Sort and Scan Libraries for GPUs , 2015, Parallel Process. Lett..
[5] Naga K. Govindaraju,et al. Fast scan algorithms on graphics processors , 2008, ICS '08.
[6] Wolfgang Lehner,et al. Big data causing big (TLB) problems: taming random memory accesses on the GPU , 2017, DaMoN.
[7] Gustavo Alonso,et al. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).
[8] Jignesh M. Patel,et al. Design and evaluation of main memory hash join algorithms for multi-core CPUs , 2011, SIGMOD '11.
[9] W. Daniel Hillis,et al. Data parallel algorithms , 1986, CACM.
[10] Bingsheng He,et al. Efficient gather and scatter operations on graphics processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[11] Jens Dittrich,et al. On the Surprising Difficulty of Simple Things: the Case of Radix Partitioning , 2015, Proc. VLDB Endow..
[12] Jack Sklansky,et al. Conditional-Sum Addition Logic , 1960, IRE Trans. Electron. Comput..
[13] Qiong Luo,et al. Revisiting Multi-pass Scatter and Gather on GPUs , 2018, ICPP.
[14] Mark J. Harris,et al. Parallel Prefix Sum (Scan) with CUDA , 2011 .
[15] Bingsheng He,et al. Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Approach , 2015, Proc. VLDB Endow..
[16] Tilmann Rabl,et al. Generating custom code for efficient query execution on heterogeneous processors , 2017, The VLDB Journal.
[17] Shengen Yan,et al. StreamScan: fast scan algorithms for GPUs without global barrier synchronization , 2013, PPoPP '13.
[18] Kenneth A. Ross,et al. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort , 2014, SIGMOD Conference.
[19] Shubhabrata Sengupta,et al. Efficient Parallel Scan Algorithms for GPUs , 2011 .
[20] Samuel Madden,et al. Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware , 2016, Proc. VLDB Endow..
[21] Andrew S. Grimshaw,et al. Parallel Scan for Stream Architectures , 2012 .
[22] Guy E. Blelloch,et al. Scans as Primitive Parallel Operations , 1989, ICPP.
[23] Harold S. Stone,et al. A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.
[24] Bingsheng He,et al. Relational joins on graphics processors , 2008, SIGMOD Conference.
[25] Eitan M. Gurari,et al. Introduction to the theory of computation , 1989 .
[26] Yao Zhang,et al. Improving Performance Portability in OpenCL Programs , 2013, ISC.
[27] Jie Shen,et al. Performance Traps in OpenCL for CPUs , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.
[28] Jianbin Fang,et al. Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels , 2014, 2014 43rd International Conference on Parallel Processing.
[29] Ulrich Meyer,et al. GPU Multisplit , 2017, ACM Trans. Parallel Comput..
[30] Pradeep Dubey,et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.
[31] Duane Merrill,et al. Single-pass Parallel Prefix Scan with Decoupled Lookback , 2016 .
[32] Kenneth A. Ross,et al. Rethinking SIMD Vectorization for In-Memory Databases , 2015, SIGMOD Conference.
[33] H. T. Kung,et al. A Regular Layout for Parallel Adders , 1982, IEEE Transactions on Computers.