Ameliorating memory contention of OLAP operators on GPU processors

Implementations of database operators on GPU processors have shown dramatic performance improvement compared to multicore-CPU implementations. GPU threads can cooperate using shared memory, which is organized in interleaved banks and is fast only when threads read and modify addresses belonging to distinct memory banks. Therefore, data processing operators implemented on a GPU, in addition to contention caused by popular values, have to deal with a new performance limiting factor: thread serialization when accessing values belonging to the same bank. Here, we define the problem of bank and value conflict optimization for data processing operators using the CUDA platform. To analyze the impact of these two factors on operator performance we use two database operations: foreignkey join and grouped aggregation. We suggest and evaluate techniques for optimizing the data arrangement offline by creating clones of values to reduce overall memory contention. Results indicate that columns used for writes, as grouping columns, need be optimized to fully exploit the maximum bandwidth of shared memory.

[1]  Junfeng Yang,et al.  Kinesis: A new approach to replica placement in distributed storage systems , 2009, TOS.

[2]  Setrag Khoshafian,et al.  A decomposition storage model , 1985, SIGMOD Conference.

[3]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Kevin Skadron,et al.  Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.

[5]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[6]  Kenneth A. Ross,et al.  Parallel buffers for chip multiprocessors , 2007, DaMoN '07.

[7]  Patrick Valduriez,et al.  Efficient Main Memory Data Management Using the DBGraph Storage Model , 1990, VLDB.

[8]  Bingsheng He,et al.  GPUQP: query co-processing using graphics processors , 2007, SIGMOD '07.

[9]  Ravi Krishnamurthy,et al.  Query optimization in a memory-resident domain relational calculus database system , 1990, TODS.

[10]  Bingsheng He,et al.  Supporting extended precision on graphics processors , 2010, DaMoN '10.

[11]  Pradeep Dubey,et al.  FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.

[12]  Garth A. Gibson,et al.  Parity declustering for continuous operation in redundant disk arrays , 1992, ASPLOS V.

[13]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[14]  Kenneth A. Ross,et al.  Scalable aggregation on multicore processors , 2011, DaMoN '11.

[15]  Matthew Huras,et al.  Multi-dimensional clustering: a new data layout scheme in DB2 , 2003, SIGMOD '03.

[16]  Kenneth A. Ross,et al.  Automatic contention detection and amelioration for data-intensive operations , 2010, SIGMOD Conference.

[17]  Úlfar Erlingsson,et al.  A cool and practical alternative to traditional hash tables , 2006 .

[18]  Bingsheng He,et al.  Database compression on graphics processors , 2010, Proc. VLDB Endow..

[19]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2004, SIGMOD '04.

[20]  Kenneth A. Ross Efficient Hash Probes on Modern Processors , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  Bingsheng He,et al.  Relational query coprocessing on graphics processors , 2009, TODS.

[22]  Bingsheng He,et al.  Efficient gather and scatter operations on graphics processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[23]  Volker Markl,et al.  Improving OLAP performance by multidimensional hierarchical clustering , 1999, Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265).