Beyond the Wall: Near-Data Processing for Databases

The continuous growth of main memory size allows modern data systems to process entire large scale datasets in memory. The increase in memory capacity, however, is not matched by proportional decrease in memory latency, causing a mismatch for in-memory processing. As a result, data movement through the memory hierarchy is now one of the main performance bottlenecks for main memory data systems. Database systems researchers have proposed several innovative solutions to minimize data movement and to make data access patterns hardware-aware. Nevertheless, all relevant rows and columns for a given query have to be moved through the memory hierarchy; hence, movement of large data sets is on the critical path. In this paper, we present JAFAR, a Near-Data Processing (NDP) accelerator for pushing selects down to memory in modern column-stores. JAFAR implements the select operator and allows only qualifying data to travel up the memory hierarchy. Through a detailed simulation of JAFAR hardware we show that it has the potential to provide 9x improvement for selects in column-stores. In addition, we discuss both hardware and software challenges for using NDP in database systems as well as opportunities for further NDP accelerators to boost additional relational operators.

[1]  Martin L. Kersten,et al.  Database Architecture Optimized for the New Bottleneck: Memory Access , 1999, VLDB.

[2]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[3]  Seth Copen Goldstein,et al.  PipeRench: A Reconfigurable Architecture and Compiler , 2000, Computer.

[4]  Gustavo Alonso,et al.  Ibex - An Intelligent Storage Engine with Support for Advanced SQL Off-loading , 2014, Proc. VLDB Endow..

[5]  Feifei Li,et al.  Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads , 2014, IEEE Micro.

[6]  Babak Falsafi,et al.  Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Mihai F. Ionescu,et al.  Optimizing parallel bitonic sort , 1997, Proceedings 11th International Parallel Processing Symposium.

[8]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[9]  Martin L. Kersten,et al.  MonetDB: Two Decades of Research in Column-oriented Database Architectures , 2012, IEEE Data Eng. Bull..

[10]  Kenneth A. Ross,et al.  Navigating big data with high-throughput, energy-efficient data partitioning , 2013, ISCA.

[11]  Jim Tørresen,et al.  FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting , 2011, FPGA '11.

[12]  Kenneth A. Ross,et al.  Making B+- trees cache conscious in main memory , 2000, SIGMOD '00.

[13]  José Francisco Martínez Trinidad,et al.  An FPGA-based parallel sorting architecture for the Burrows Wheeler transform , 2005, 2005 International Conference on Reconfigurable Computing and FPGAs (ReConFig'05).

[14]  Anastasia Ailamaki,et al.  Sharing Data and Work Across Concurrent Analytical Queries , 2013, Proc. VLDB Endow..

[15]  Gustavo Alonso,et al.  SharedDB: Killing One Thousand Queries With One Stone , 2012, Proc. VLDB Endow..

[16]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[17]  Harold S. Stone,et al.  A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.

[18]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[19]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[20]  Gu-Yeon Wei,et al.  Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[21]  David J. DeWitt,et al.  Weaving Relations for Cache Performance , 2001, VLDB.

[22]  James E. Smith,et al.  Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[23]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[24]  Eleni Petraki,et al.  Database cracking: fancy scan, not poor man's sort! , 2014, DaMoN '14.

[25]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[26]  David J. DeWitt,et al.  Database Machines: An Idea Whose Time Passed? A Critique of the Future of Database Machines , 1989, IWDM.

[27]  Luigi Dadda,et al.  The design of a high speed ASIC unit for the hash function SHA-256 (384, 512) , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[28]  Engin Ipek,et al.  A resistive TCAM accelerator for data-intensive computing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29]  Gabriel Zachmann,et al.  GPU-ABiSort: optimal parallel sorting on stream architectures , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[30]  Gustavo Alonso,et al.  Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[31]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[32]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[33]  Eleni Petraki,et al.  Holistic Indexing in Main-memory Column-stores , 2015, SIGMOD Conference.

[34]  Frederick Reiss,et al.  Main-memory scan sharing for multi-core CPUs , 2008, Proc. VLDB Endow..

[35]  Akashi Satoh,et al.  ASIC hardware focused comparison for hash functions MD5, RIPEMD-160, and SHS , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[36]  Duncan G. Elliott,et al.  Computational Ram: A Memory-simd Hybrid And Its Application To Dsp , 1992, 1992 Proceedings of the IEEE Custom Integrated Circuits Conference.

[37]  Harumi A. Kuno,et al.  Merging What's Cracked, Cracking What's Merged: Adaptive Indexing in Main-Memory Column-Stores , 2011, Proc. VLDB Endow..

[38]  Anastasia Ailamaki,et al.  ATraPos: Adaptive transaction processing on hardware Islands , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[39]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[40]  Jeffrey F. Naughton,et al.  Cache Conscious Algorithms for Relational Query Processing , 1994, VLDB.

[41]  Dan Lin,et al.  SQRL: Hardware accelerator for collecting software data structures , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[42]  William H. Kautz,et al.  Cellular Logic-in-Memory Arrays , 1969, IEEE Transactions on Computers.

[43]  Marcin Zukowski,et al.  Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS , 2007, VLDB.

[44]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[45]  Kenneth A. Ross,et al.  Q100: the architecture and design of a database processing unit , 2014, ASPLOS.

[46]  Alexander Zeier,et al.  SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units , 2009, Proc. VLDB Endow..

[47]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[48]  Luigi Dadda,et al.  An ASIC design for a high speed implementation of the hash function SHA-256 (384, 512) , 2004, GLSVLSI '04.

[49]  Steven Swanson,et al.  Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.

[50]  Jae-Gil Lee,et al.  Joins on Encoded and Partitioned Data , 2014, Proc. VLDB Endow..

[51]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[52]  Tao Li,et al.  Informed Microarchitecture Design Space Exploration Using Workload Dynamics , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[53]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[54]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[55]  Stamatis Vassiliadis,et al.  Cost-Efficient SHA Hardware Accelerators , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[56]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[57]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[58]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).