Scaling the Construction of Wavelet Synopses for Maximum Error Metrics

Modern analytics involve computations over enormous numbers of data records. The volume of data and the stringent response-time requirements place increasing emphasis on the efficiency of approximate query processing. A major challenge over the past years has been the construction of synopses that provide a deterministic quality guarantee, often expressed in terms of a maximum error metric. By approximating sharp discontinuities, wavelet decomposition has proved to be a very effective tool for data reduction. However, existing wavelet thresholding schemes that minimize maximum error metrics are constrained with impractical complexities for large datasets. Furthermore, they cannot efficiently handle the multi-dimensional version of the problem. In order to provide a practical solution, we develop parallel algorithms that take advantage of key-properties of the wavelet decomposition and allocate tasks to multiple workers. To that end, we present (i) a general framework for the parallelization of existing dynamic programming algorithms, (ii) a parallel version of one such DP algorithm, and (iii) two highly efficient distributed greedy algorithms that can deal with data of arbitrary dimensionality. Our extensive experiments on both real and synthetic datasets over Hadoop show that the proposed algorithms achieve linear scalability and superior running-time performance compared to their centralized counterparts.

[1]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[2]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[3]  Minos N. Garofalakis,et al.  Wavelet synopses with error guarantees , 2002, SIGMOD '02.

[4]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[5]  Nikos Mamoulis,et al.  The Haar+ Tree: A Refined Synopsis Data Structure , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[6]  Shenghuo Zhu,et al.  A survey on wavelet applications in data mining , 2002, SKDD.

[7]  Feifei Li,et al.  Building Wavelet Histograms on Large Data in MapReduce , 2011, Proc. VLDB Endow..

[8]  Amit Kumar,et al.  Wavelet synopses for general error metrics , 2005, TODS.

[9]  Dimitris Sacharidis,et al.  Fast Approximate Wavelet Tracking on Streams , 2006, EDBT.

[10]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[11]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[12]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[13]  Barzan Mozafari Verdict: A System for Stochastic Query Planning , 2015, CIDR.

[14]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[15]  Yossi Matias,et al.  Workload-Based Wavelet Synopses , 2005 .

[16]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[17]  David Salesin,et al.  Wavelets for computer graphics: theory and applications , 1996 .

[18]  Nikos Mamoulis,et al.  Hierarchical synopses with optimal error guarantees , 2008, TODS.

[19]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[20]  Dimitris Sacharidis,et al.  Exploiting duality in summarization with deterministic guarantees , 2007, KDD '07.

[21]  Nikos Mamoulis,et al.  One-Pass Wavelet Synopses for Maximum-Error Metrics , 2005, VLDB.

[22]  Amit P. Sheth,et al.  Linked sensor data , 2010, 2010 International Symposium on Collaborative Technologies and Systems.

[23]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[24]  Christoph Koch,et al.  An Incremental Anytime Algorithm for Multi-Objective Query Optimization , 2015, SIGMOD Conference.

[25]  S. Muthukrishnan,et al.  Subquadratic Algorithms for Workload-Aware Haar Wavelet Synopses , 2005, FSTTCS.

[26]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[27]  Abdul Wasay,et al.  Data Canopy: Accelerating Exploratory Statistical Analysis , 2017, SIGMOD Conference.

[28]  Chaoyi Pang,et al.  On Multidimensional Wavelet Synopses for Maximum Error Bounds , 2009, DASFAA.

[29]  Chaoyi Pang,et al.  Unrestricted wavelet synopses under maximum error bound , 2009, EDBT '09.

[30]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[31]  Bruce G. Terrell,et al.  National Oceanic and Atmospheric Administration , 2020, Federal Regulatory Guide.

[32]  Amit Kumar,et al.  Deterministic wavelet thresholding for maximum-error metrics , 2004, PODS.

[33]  Huanyu Zhao,et al.  Image Compression Based on Restricted Wavelet Synopses with Maximum Error Bound , 2016, 2016 IEEE/ACM 9th International Conference on Utility and Cloud Computing (UCC).

[34]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[35]  Sudipto Guha,et al.  Space Efficiency in Synopsis Construction Algorithms , 2005, VLDB.

[36]  Dimitrios Tsoumakos,et al.  Distributed Wavelet Thresholding for Maximum Error Metrics , 2016, SIGMOD Conference.