Learning to accurately COUNT with query-driven predictive analytics

We study a novel solution to executing aggregation (and specifically COUNT) queries over large-scale data. The proposed solution is generally applicable, in the sense that it can be deployed in environments in which data owners may or may not restrict access to their data and allow only `aggregation operators' to be executed over their data. For this, it is based on predictive analytics, driven by queries and their results. We propose a machine learning (ML) framework for the task (which can be adapted for different aggregates as well). We focus on the widely used set-cardinality (i.e., COUNT) aggregation operator, as it is a fundamental operator for both internal data system optimisations and for aggregation-query analytics. We contribute a novel, query-driven ML model whose goals are to: (i) learn the query space (access patterns), (ii) associate (complex) aggregation queries with the cardinality of their results, (iii) define query similarity and use it to predict the cardinality of the answer set of an ad-hoc incoming query. Our ML model incorporates incremental learning algorithms for ensuring high prediction accuracy even when both the querying patterns and the underlying data change. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general environments which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for big data analytics, and (iii) offers a performance (in terms of prediction accuracy and time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model, evaluating its sensitivity and comparative advantages versus acclaimed data-centric methods (self-tuning histograms, sampling, and multidimensional histograms).

[1]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[2]  Luke J. Gosink An Application of Multivariate Statistical Analysis for Query-Driven Visualization - eScholarship , 2010 .

[3]  Dimitrios Gunopulos,et al.  Selectivity estimators for multidimensional range queries over real attributes , 2005, The VLDB Journal.

[4]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[6]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[7]  Koji Zettsu,et al.  Dynamic pre-training of Deep Recurrent Neural Networks for predicting environmental monitoring data , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[8]  Atanas Atanasov,et al.  Query-driven parallel exploration of large datasets , 2012, IEEE Symposium on Large Data Analysis and Visualization (LDAV).

[9]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[10]  H. Robbins A Stochastic Approximation Method , 1951 .

[11]  Natasha Balac,et al.  Large Scale predictive analytics for real-time energy management , 2013, 2013 IEEE International Conference on Big Data.

[12]  Teuvo Kohonen,et al.  Self-Organizing Maps, Third Edition , 2001, Springer Series in Information Sciences.

[13]  Abon Chaudhuri,et al.  Efficient Range Distribution Query for Visualizing Scientific Data , 2014, 2014 IEEE Pacific Visualization Symposium.

[14]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[15]  Prateek Jain,et al.  A Learning Framework for Self-Tuning Histograms , 2011, ArXiv.

[16]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[17]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[18]  Cheng-Hao Tsai,et al.  Large-scale logistic regression and linear support vector machines using spark , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[19]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[20]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[21]  Cyrus Shahabi,et al.  Entropy-based histograms for selectivity estimation , 2013, CIKM.

[22]  Deok-Hwan Kim,et al.  Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[23]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[24]  Christos Faloutsos,et al.  The power-method: a comprehensive estimation technique for multi-dimensional queries , 2003, CIKM '03.

[25]  Bart Kosko,et al.  Stochastic competitive learning , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[26]  James Theiler,et al.  Accurate On-line Support Vector Regression , 2003, Neural Computation.

[27]  Jeffrey Scott Vitter,et al.  Data cube approximation and histograms via wavelets , 1998, CIKM '98.

[28]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.