PolyFit: Polynomial-based Indexing Approach for Fast Approximate Range Aggregate Queries

Range aggregate queries find frequent application in data analytics. In some use cases, approximate results are preferred over accurate results if they can be computed rapidly and satisfy approximation guarantees. Inspired by a recent indexing approach, we provide means of representing a discrete point data set by continuous functions that can then serve as compact index structures. More specifically, we develop a polynomial-based indexing approach, called PolyFit, for processing approximate range aggregate queries. PolyFit is capable of supporting multiple types of range aggregate queries, including COUNT, SUM, MIN and MAX aggregates, with guaranteed absolute and relative error bounds. Experiment results show that PolyFit is faster and more accurate and compact than existing learned index structures.

[1]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[2]  Eli Upfal,et al.  The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling , 2011, ECML/PKDD.

[3]  Srinivas Devadas,et al.  Sundial: Harmonizing Concurrency Control and Caching in a Distributed OLTP Database Management System , 2018, Proc. VLDB Endow..

[4]  Torben Bach Pedersen,et al.  ModelarDB: Modular Model-Based Time Series Management with Spark and Cassandra , 2018, Proc. VLDB Endow..

[5]  Kian-Lee Tan,et al.  Temporal Spatial-Keyword Top-k publish/subscribe , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[6]  Kalen Delaney Inside Microsoft SQL Server 2000 , 2000 .

[7]  Minos N. Garofalakis,et al.  Wavelet synopses with error guarantees , 2002, SIGMOD '02.

[8]  Eamonn J. Keogh Fast similarity search in the presence of longitudinal scaling in time series databases , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[9]  Andrew Chi-Chih Yao,et al.  An Almost Optimal Algorithm for Unbounded Searching , 1976, Inf. Process. Lett..

[10]  Jeffrey F. Naughton,et al.  On the relative cost of sampling for join selectivity estimation , 1994, PODS '94.

[11]  Dimitrios Gunopulos,et al.  Selectivity estimators for multidimensional range queries over real attributes , 2005, The VLDB Journal.

[12]  Peter Triantafillou,et al.  DBEst: Revisiting Approximate Query Processing Engines with Machine Learning Models , 2019, SIGMOD Conference.

[13]  Keqin Li,et al.  FastRAQ: A Fast Approach to Range-Aggregate Queries in Big Data Environments , 2015, IEEE Transactions on Cloud Computing.

[14]  Abdul Wasay,et al.  Data Canopy: Accelerating Exploratory Statistical Analysis , 2017, SIGMOD Conference.

[15]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[16]  Mark de Berg,et al.  Computational Geometry: Algorithms and Applications, Second Edition , 2000 .

[17]  Michael S. Bernstein,et al.  Processing and visualizing the data in tweets , 2011, SGMD.

[18]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.

[19]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[20]  Justin David Durfee,et al.  Comparison of open-source linear programming solvers. , 2013 .

[21]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[22]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[23]  Reynold Cheng,et al.  Efficient Algorithms for Kernel Aggregation Queries , 2022, IEEE Transactions on Knowledge and Data Engineering.

[24]  Dimitrios Gunopulos,et al.  Streaming Time Series Summarization Using User-Defined Amnesic Functions , 2008, IEEE Transactions on Knowledge and Data Engineering.

[25]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[26]  Panos Kalnis,et al.  Efficient OLAP Operations in Spatial Data Warehouses , 2001, SSTD.

[27]  Ahmed Eldawy,et al.  SHAHED: A MapReduce-based system for querying and visualizing spatio-temporal satellite data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[28]  Mark de Berg,et al.  Computational geometry: algorithms and applications, 3rd Edition , 1997 .

[29]  D. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[30]  Peter J. Haas,et al.  Consistent selectivity estimation via maximum entropy , 2007, The VLDB Journal.

[31]  Clifford A. Lynch,et al.  Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values , 1988, VLDB.

[32]  Carsten Binnig,et al.  FITing-Tree: A Data-aware Index Structure , 2018, SIGMOD Conference.

[33]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[34]  Peter Triantafillou,et al.  Learning to accurately COUNT with query-driven predictive analytics , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[35]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[36]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[37]  Jeffrey Scott Vitter,et al.  SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads , 2003, VLDB.

[38]  Joseph M. Hellerstein,et al.  Online aggregation and continuous query support in MapReduce , 2010, SIGMOD Conference.

[39]  Yannis Papakonstantinou,et al.  Approximate Analytics System over Compressed Time Series with Tight Deterministic Error Guarantees , 2020, Proc. VLDB Endow..

[40]  Barzan Mozafari,et al.  QuickSel: Quick Selectivity Learning with Mixture Models , 2018, SIGMOD Conference.

[41]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[42]  Peter Triantafillou,et al.  Aggregate Query Prediction under Dynamic Workloads , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[43]  Man Lung Yiu,et al.  KARL: Fast Kernel Aggregation Queries , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[44]  Eamonn J. Keogh,et al.  An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback , 1998, KDD.

[45]  Sharad Mehrotra,et al.  Progressive approximate aggregate queries with a multi-resolution tree structure , 2001, SIGMOD '01.

[46]  Samuel Madden,et al.  Scorpion: Explaining Away Outliers in Aggregate Queries , 2013, Proc. VLDB Endow..

[47]  Cyrus Shahabi,et al.  Entropy-based histograms for selectivity estimation , 2013, CIKM.

[48]  Domine M. W. Leenaerts,et al.  Piecewise Linear Modeling and Analysis , 1998 .

[49]  Jianzhong Li,et al.  An Iterative Scheme for Leverage-Based Approximate Aggregation , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[50]  Christopher Ré,et al.  Understanding cardinality estimation using entropy maximization , 2012, ACM Trans. Database Syst..

[51]  Renée J. Miller,et al.  Similarity search over time-series data using wavelets , 2002, Proceedings 18th International Conference on Data Engineering.

[52]  Barzan Mozafari,et al.  VerdictDB: Universalizing Approximate Query Processing , 2018, SIGMOD Conference.

[53]  Volker Markl,et al.  Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation , 2015, SIGMOD Conference.

[54]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[55]  David Corwin Galois Theory , 2009 .

[56]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[57]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[58]  Reynold Cheng,et al.  QUAD: Quadratic-Bound-based Kernel Density Visualization , 2020, SIGMOD Conference.

[59]  Yin Tat Lee,et al.  Efficient Inverse Maintenance and Faster Algorithms for Linear Programming , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[60]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD '00.

[61]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[62]  Jianliang Xu,et al.  Learned Index for Spatial Queries , 2019, 2019 20th IEEE International Conference on Mobile Data Management (MDM).

[63]  Bruce Momjian,et al.  PostgreSQL: Introduction and Concepts , 2000 .

[64]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[65]  Minos N. Garofalakis,et al.  Probabilistic wavelet synopses , 2004, TODS.

[66]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[67]  Walid G. Aref,et al.  Online Piece-wise Linear Approximation of Numerical Streams with Precision Guarantees , 2009, Proc. VLDB Endow..