Lightweight Cardinality Estimation in LSM-based Systems

Data sources, such as social media, mobile apps and IoT sensors, generate billions of records each day. Keeping up with this influx of data while providing useful analytics to the users is a major challenge for today's data-intensive systems. A popular solution that allows such systems to handle rapidly incoming data is to rely on log-structured merge (LSM) storage models. LSM-based systems provide a tunable trade-off between ingesting vast amounts of data at a high rate and running efficient analytical queries on top of that data. For queries, it is well-known that the query processing performance largely depends on the ability to generate efficient execution plans. Previous research showed that OLAP query workloads rely on having small, yet precise, statistical summaries of the underlying data, which can drive the cost-based query optimization. In this paper we address the problem of computing data statistics for workloads with rapid data ingestion and propose a lightweight statistics-collection framework that exploits the properties of LSM storage. Our approach is designed to piggyback on the events (flush and merge) of the LSM lifecycle. This allows us to easily create an initial statistics and then keep them in sync with rapidly changing data while minimizing the overhead to the existing system. We have implemented and adapted well-known algorithms to produce various types of statistical synopses, including equi-width histograms, equi-height histograms, and wavelets. We performed an in-depth empirical evaluation that considers both the cardinality estimation accuracy and runtime overheads of collecting and using statistics. The experiments were conducted by prototyping our approach on top of Apache AsterixDB, an open source Big Data management system that has an entirely LSM-based storage backend.

[1]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[2]  Benoît Dageville,et al.  Efficient and scalable statistics gathering for large databases in Oracle 11g , 2008, SIGMOD Conference.

[3]  Dimitris Sacharidis,et al.  Fast Approximate Wavelet Tracking on Streams , 2006, EDBT.

[4]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[5]  Wolfgang Lehner,et al.  Cardinality estimation using sample views with quality assurance , 2007, SIGMOD '07.

[6]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[7]  Jeffrey Scott Vitter,et al.  Wavelet-Based Cost Estimation for Spatial Queries , 2001, SSTD.

[8]  Martin Arlitt,et al.  A workload characterization study of the 1998 World Cup Web site , 2000, IEEE Netw..

[9]  Peter J. Haas,et al.  Maintaining bounded-size sample synopses of evolving datasets , 2008, The VLDB Journal.

[10]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[11]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[12]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[13]  Peter J. Haas,et al.  Synopses for Massive Data , 2012 .

[14]  Michael J. Carey,et al.  Data Ingestion in AsterixDB , 2015, EDBT.

[15]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[16]  Milind Bhandarkar,et al.  HAWQ: a massively parallel processing SQL engine in hadoop , 2014, SIGMOD Conference.

[17]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[18]  Peter J. Haas,et al.  Techniques for Warehousing of Sample Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[19]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[20]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[21]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[22]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[23]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[24]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[25]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[26]  Neeraj Kumar,et al.  SnappyData: A Hybrid Transactional Analytical Store Built On Spark , 2016, SIGMOD Conference.

[27]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[29]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[30]  Chen Li,et al.  Storage Management in AsterixDB , 2014, Proc. VLDB Endow..

[31]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[32]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[33]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[34]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[35]  David Salesin,et al.  Wavelets for computer graphics: a primer.1 , 1995, IEEE Computer Graphics and Applications.

[36]  Hai Wang,et al.  A multi-dimensional histogram for selectivity estimation and fast approximate query answering , 2003, CASCON.

[37]  Jeffrey Scott Vitter,et al.  Data cube approximation and histograms via wavelets , 1998, CIKM '98.

[38]  Chen Li,et al.  AsterixDB: A Scalable, Open Source BDMS , 2014, Proc. VLDB Endow..

[39]  Paul M. Aoki Algorithms for index-assisted selectivity estimation , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[40]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[41]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[42]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.