Multidimensional range queries on modern hardware

Range queries over multidimensional data are an important part of database workloads in many applications. Their execution may be accelerated by using multidimensional index structures (MDIS), such as kd-trees or R-trees. As for most index structures, the usefulness of this approach depends on the selectivity of the queries, and common wisdom told that a simple scan beats MDIS for queries accessing more than 15%-20% of a dataset. However, this wisdom is largely based on evaluations that are almost two decades old, performed on data being held on disks, applying IO-optimized data structures, and using single-core systems. The question is whether this rule of thumb still holds when multidimensional range queries (MDRQ) are performed on modern architectures with large main memories holding all data, multi-core CPUs and data-parallel instruction sets. In this paper, we study the question whether and how much modern hardware influences the performance ratio between index structures and scans for MDRQ. To this end, we conservatively adapted three popular MDIS, namely the R*-tree, the kd-tree, and the VA-file, to exploit features of modern servers and compared their performance to different flavors of parallel scans using multiple (synthetic and real-world) analytical workloads over multiple (synthetic and real-world) datasets of varying size, dimensionality, and skew. We find that all approaches benefit considerably from using main memory and parallelization, yet to varying degrees. Our evaluation indicates that, on current machines, scanning should be favored over parallel versions of classical MDIS even for very selective queries.

[1]  Nick Roussopoulos,et al.  Cubetree: Organization of and Bulk Updates on the Data Cube , 1997, SIGMOD Conference.

[2]  M. King,et al.  Evolution at two levels in humans and chimpanzees. , 1975, Science.

[3]  Christian Böhm,et al.  Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[4]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[5]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[6]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[7]  Young-Jin Kim,et al.  Multi-dimensional range queries in sensor networks , 2003, SenSys '03.

[8]  Maria E. Orlowska,et al.  Range queries in dynamic OLAP data cubes , 2000, Data Knowl. Eng..

[9]  J. T. Robinson,et al.  The K-D-B-tree: a search structure for large multidimensional dynamic indexes , 1981, SIGMOD '81.

[10]  Jignesh M. Patel,et al.  BitWeaving: fast scans for main memory data processing , 2013, SIGMOD '13.

[11]  Christos Faloutsos,et al.  Declustering Spatial Databases on a Multi-Computer Architecture , 1996, EDBT.

[12]  Hans-Jörg Schek,et al.  Interactive-Time Similarity Search for Large Image Collections Using Parallel VA-Files , 2000, ECDL.

[13]  Satyanarayana R. Valluri,et al.  Query Optimization in Oracle 12c Database In-Memory , 2015, Proc. VLDB Endow..

[14]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[15]  Christos Faloutsos,et al.  Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension , 1994, PODS.

[16]  Christian Böhm,et al.  Optimal Multidimensional Query Processing Using Tree Striping , 2000, DaWaK.

[17]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[18]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[19]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[20]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[21]  Alfons Kemper,et al.  HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[22]  Heng Li,et al.  Tabix: fast retrieval of sequence features from generic TAB-delimited files , 2011, Bioinform..

[23]  Kihong Kim,et al.  Optimizing multidimensional index trees for main memory access , 2001, SIGMOD '01.

[24]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[25]  Kenneth A. Ross,et al.  Implementing database operations using SIMD instructions , 2002, SIGMOD '02.

[26]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[27]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[28]  Sudipta Sengupta,et al.  The Bw-Tree: A B-tree for new hardware platforms , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[29]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[30]  Dimitrios Gunopulos,et al.  Selectivity estimators for multidimensional range queries over real attributes , 2005, The VLDB Journal.

[31]  A. Lièvre,et al.  KRAS mutation status is predictive of response to cetuximab therapy in colorectal cancer. , 2006, Cancer research.

[32]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[33]  Norman May,et al.  The SAP HANA Database -- An Architecture Overview , 2012, IEEE Data Eng. Bull..

[34]  Moira C. Norrie,et al.  The PH-tree: a space-efficient storage structure and multi-dimensional index , 2014, SIGMOD Conference.

[35]  Scott T. Leutenegger,et al.  Master-client R-trees: a new parallel R-tree architecture , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[36]  Beng Chin Ooi,et al.  Fast and Adaptive Indexing of Multi-Dimensional Observational Data , 2016, Proc. VLDB Endow..

[37]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[38]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[39]  Rong Chen,et al.  Integrating 400 million variants from 80,000 human samples with extensive annotations: towards a knowledge base to analyze disease cohorts , 2016, BMC Bioinformatics.

[40]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[41]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[42]  Bernd-Uwe Pagel,et al.  Towards an analysis of range query performance in spatial data structures , 1993, PODS '93.

[43]  Kenneth A. Ross,et al.  Rethinking SIMD Vectorization for In-Memory Databases , 2015, SIGMOD Conference.

[44]  Nick Roussopoulos,et al.  Cubetree: organization of and bulk incremental updates on the data cube , 1997, SIGMOD '97.