HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics

As data sets grow and conventional processor performance scaling slows, data analytics move towards heterogeneous architectures that incorporate hardware accelerators (notably GPUs) to continue scaling performance. However, existing GPU-based databases fail to deal with big data applications efficiently: their execution model suffers from scalability limitations on GPUs whose memory capacity is limited; existing systems fail to consider the discrepancy between fast GPUs and slow storage, which can counteract the benefit of GPU accelerators. In this paper, we propose HippogriffDB, an efficient, scalable GPU-accelerated OLAP system. It tackles the bandwidth discrepancy using compression and an optimized data transfer path. HippogriffDB stores tables in a compressed format and uses the GPU for decompression, trading GPU cycles for the improved I/O bandwidth. To improve the data transfer efficiency, HippogriffDB introduces a peer-to-peer, multi-threaded data transfer mechanism, directly transferring data from the SSD to the GPU. HippogriffDB adopts a query-over-block execution model that provides scalability using a stream-based approach. The model improves kernel efficiency with the operator fusion and double buffering mechanism. We have implemented HippogriffDB using an NVMe SSD, which talks directly to a commercial GPU. Results on two popular benchmarks demonstrate its scalability and efficiency. HippogriffDB outperforms existing GPU-based databases (YDB) and in-memory data analytics (MonetDB) by 1-2 orders of magnitude.

[1]  James C. Hoe,et al.  Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[3]  Jennifer Widom,et al.  Challenges and Opportunities with Big Data 2011-1 , 2011 .

[4]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[5]  Volker Markl,et al.  Hardware-Oblivious Parallelism for In-Memory Column-Stores , 2013, Proc. VLDB Endow..

[6]  Marcin Zukowski,et al.  MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[7]  S. Swanson,et al.  Gullfoss : Accelerating and Simplifying Data Movement among Heterogeneous Computing and Storage Resources , 2015 .

[8]  Sangman Kim,et al.  Networking abstractions for GPU programs , 2015 .

[9]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[10]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[11]  Sebastian Breß,et al.  Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS , 2013, Proc. VLDB Endow..

[12]  Martin Burtscher,et al.  Floating-point data compression at 75 Gb/s on a GPU , 2011, GPGPU-4.

[13]  A Survey of Compressed Domain Processing Techniques , .

[14]  Yao Zhang,et al.  Parallel lossless data compression on the GPU , 2012, 2012 Innovative Parallel Computing (InPar).

[15]  Robert H. Dennard,et al.  Design of ion-implanted MOSFET's with very small physical dimensions , 2007 .

[16]  Joel H. Saltz,et al.  Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems , 2012, Proc. VLDB Endow..

[17]  Mahmut T. Kandemir,et al.  A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[19]  Bingsheng He,et al.  Database compression on graphics processors , 2010, Proc. VLDB Endow..

[20]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[21]  Bingsheng He,et al.  High-Throughput Transaction Executions on Graphics Processors , 2011, Proc. VLDB Endow..

[22]  Xuedong Chen,et al.  The Star Schema Benchmark and Augmented Fact Table Indexing , 2009, TPCTC.

[23]  Ralph Kimball,et al.  The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling , 2013 .

[24]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[25]  Siyuan Ma,et al.  Concurrent Analytical Query Processing with GPUs , 2014, Proc. VLDB Endow..

[26]  Yuan Yuan,et al.  The Yin and Yang of Processing Data Warehousing Queries on GPU Devices , 2013, Proc. VLDB Endow..

[27]  Yang Liu,et al.  Hippogriff: Efficiently moving data in heterogeneous computing systems , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[28]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[29]  Nam Sung Kim,et al.  Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30]  Yang Liu,et al.  Willow: A User-Programmable SSD , 2014, OSDI.

[31]  Mark Silberstein,et al.  GPUnet , 2014, OSDI.

[32]  Sudhakar Yalamanchili,et al.  Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  Jukka Teuhola,et al.  A Compression Method for Clustered Bit-Vectors , 1978, Inf. Process. Lett..

[34]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[35]  Paolo Toth,et al.  Knapsack Problems: Algorithms and Computer Implementations , 1990 .

[36]  Babak Falsafi,et al.  Toward Dark Silicon in Servers , 2011, IEEE Micro.

[37]  Daniel J. Abadi,et al.  Query execution in column-oriented database systems , 2008 .

[38]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..