Relational Query Co-Processing on Graphics Processors 1

Graphics processors (GPUs) have recently emerged as a powerful co-processor for general-purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs allow writes to random memory locations, provide e‐cient inter-processor communication through on-chip local memory, and support a general-purpose parallel programming model. Nevertheless, many of the GPU features are specialized for graphics processing, including the massively multi-threaded architecture, the Single-Instruction-Multiple-Data processing style, and the execution model of a single application at a time. Additionally, GPUs rely on a bus of limited bandwidth to transfer data from and to the CPU, do not allow dynamic memory allocation from GPU kernels, and have little hardware support for write con∞icts. Therefore, it requires a careful design and implementation to utilize the GPU for co-processing database queries. In this paper, we present our design, implementation, and evaluation of an in-memory relational query co-processing system, GDB, on the GPU. Taking advantage of the GPU hardware features, we design a set of highly optimized data-parallel primitives such as split and sort, and use these primitives to implement common relational query processing algorithms. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU, and use parallel computation and memory optimizations to efiectively reduce memory stalls. Furthermore, we propose co-processing techniques that take into account both the computation resources and the GPU-CPU data transfer cost so that each operator in a query can utilize suitable processors - the CPU, the GPU, or both, for an optimized overall performance. We have evaluated our GDB system on a machine with an Intel quad-core CPU and an NVIDIA GeForce 8800 GTX GPU. Our workloads include microbenchmark queries on memory-resident data as well as TPC-H queries that involve complex data types and multiple query operators on data sets larger than the GPU memory. Our results show that our GPU-based algorithms are 2-27x faster than their optimized CPU-based counterparts on in-memory data. Moreover, the performance of our co-processing scheme is similar to or better than both the GPU-only and the CPU-only schemes.

[1]  Bingsheng He,et al.  Efficient gather and scatter operations on graphics processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[2]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[3]  Michael Stonebraker,et al.  Optimization of parallel query execution plans in XPRS , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[4]  Michael Stonebraker,et al.  Optimization of parallel query execution plans in XPRS , 2005, Distributed and Parallel Databases.

[5]  Philip S. Yu,et al.  CellSort: High Performance Sorting on the Cell Processor , 2007, VLDB.

[6]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[7]  Kenneth A. Ross,et al.  Adaptive Aggregation on Chip Multiprocessors , 2007, VLDB.

[8]  Hubert Nguyen,et al.  GPU Gems 3 , 2007 .

[9]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[10]  Michael Stonebraker,et al.  OLTP through the looking glass, and what we found there , 2008, SIGMOD Conference.

[11]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[12]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[13]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[14]  Martin L. Kersten,et al.  Generic Database Cost Models for Hierarchical Memory Systems , 2002, VLDB.

[15]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[16]  Kenneth A. Ross,et al.  Improving Database Performance on Simultaneous Multithreading Processors , 2005, VLDB.

[17]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[18]  Richard E. Ladner,et al.  The influence of caches on the performance of sorting , 1997, SODA '97.

[19]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[20]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[21]  Hanan Samet,et al.  A Fast Similarity Join Algorithm Using Graphics Processing Units , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[22]  Elke A. Rundensteiner,et al.  Revisiting Pipelined Parallelism in Multi-Join Query Processing , 2005, VLDB.

[23]  Kenneth A. Ross,et al.  Realizing parallelism in database operations: insights from a massively multithreaded architecture , 2006, DaMoN '06.

[24]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[25]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[26]  J DeWittDavid,et al.  A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment , 1989 .

[27]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[28]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[29]  Kenneth A. Ross,et al.  Cache Conscious Indexing for Decision-Support in Main Memory , 1999, VLDB.

[30]  David Blythe The Direct3D 10 system , 2006, ACM Trans. Graph..

[31]  Babak Falsafi,et al.  DBmbench: fast and accurate database workload representation on modern microarchitecture , 2005, CASCON.

[32]  David J. DeWitt,et al.  A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment , 1989, SIGMOD '89.

[33]  Jeffrey F. Naughton,et al.  Cache Conscious Algorithms for Relational Query Processing , 1994, VLDB.

[34]  Martin L. Kersten,et al.  Database Architecture Optimized for the New Bottleneck: Memory Access , 1999, VLDB.

[35]  Dinesh Manocha,et al.  Fast and approximate stream mining of quantiles and frequencies using graphics processors , 2005, SIGMOD '05.

[36]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[37]  Babak Falsafi,et al.  Accelerating database operators using a network processor , 2005, DaMoN '05.

[38]  Philip S. Yu,et al.  Executing Stream Joins on the Cell Processor , 2007, VLDB.

[39]  David Blythe The Direct3D 10 system , 2006, SIGGRAPH 2006.

[40]  Divyakant Agrawal,et al.  Hardware Acceleration in Commercial Databases: A Case Study of Spatial Operations , 2004, VLDB.

[41]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[42]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2004, SIGMOD '04.

[43]  Divyakant Agrawal,et al.  Hardware acceleration for spatial selections and joins , 2003, SIGMOD '03.

[44]  Bingsheng He,et al.  Cache-oblivious databases: Limitations and opportunities , 2008, TODS.

[45]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[46]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[47]  Babak Falsafi,et al.  To Share or Not To Share? , 2007, VLDB.

[48]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2005, SIGGRAPH Courses.

[49]  Hongjun Lu,et al.  Hash-based join algorithms for multiprocessor computers with shared memory , 1990, VLDB 1990.

[50]  Dinesh Manocha,et al.  Query co-processing on commodity processors , 2006, VLDB.

[51]  Marcin Zukowski,et al.  Vectorized data processing on the cell broadband engine , 2007, DaMoN '07.