Learning Set Cardinality in Distance Nearest Neighbours

Distance-based nearest neighbours (dNN) queries and aggregations over their answer sets are important for exploratory data analytics. We focus on the Set Cardinality Prediction (SCP) problem for the answer set of dNN queries. We contribute a novel, query-driven perspective for this problem, whereby answers to previous dNN queries are used to learn the answers to incoming dNN queries. The proposed novel machine learning (ML) model learns the dynamically changing query patterns space and thus it can focus only on the portion of the data being queried. The model enjoys several comparative advantages in prediction error and space requirements. This is in addition to being applicable in environments with sensitive data and/or environments where data accesses are too costly to execute, where the data-centric state-of-the-art is inapplicable and/or too costly. A comprehensive performance evaluation of our model is conducted, evaluating its comparative advantages versus acclaimed methods (i.e., different self-tuning histograms, sampling, multidimensional histograms, and the power-method).

[1]  Surajit Chaudhuri,et al.  Exploiting statistics on query expressions for optimization , 2002, SIGMOD '02.

[2]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[3]  Christos Faloutsos,et al.  The power-method: a comprehensive estimation technique for multi-dimensional queries , 2003, CIKM '03.

[4]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[5]  H. Robbins A Stochastic Approximation Method , 1951 .

[6]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[7]  Jeffrey Scott Vitter,et al.  Data cube approximation and histograms via wavelets , 1998, CIKM '98.

[8]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[9]  Dimitrios Gunopulos,et al.  Selectivity estimators for multidimensional range queries over real attributes , 2005, The VLDB Journal.

[10]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[11]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[12]  Cyrus Shahabi,et al.  Entropy-based histograms for selectivity estimation , 2013, CIKM.

[13]  Deok-Hwan Kim,et al.  Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[14]  Stephen Grossberg,et al.  Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system , 1991, Neural Networks.

[15]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[16]  Bernd-Uwe Pagel,et al.  Towards an analysis of range query performance in spatial data structures , 1993, PODS '93.

[17]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[18]  Prateek Jain,et al.  A Learning Framework for Self-Tuning Histograms , 2011, ArXiv.

[19]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[20]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.