Approximate Query Processing on High Dimensionality Database Tables Using Multidimensional Cluster Sampling View

Approximate query processing based on random sampling is one of the most useful methods for the efficient computation of large quantities of data kept in databases. However, small samples obtained through random sampling methods might lack the appropriate data relevant to query conditions because the samples do not adequately represent the entire dataset. The Multidimensional Cluster Sampling View has been proposed to support efficient and effective approximate query processing on common database tables. This view provides random sample records to be drawn from a database in SQL efficiently and effectively. The effectiveness of approximate query processing in this view was demonstrated on a large database table with only four dimensions. This differed from the usual number of dimensions in decision support systems, which is most commonly over ten. Therefore, further examinations and evaluations focusing on dimensionality, such as ten-dimensional data and over, are required in order to demonstrate its practicality. This paper evaluates whether the number of dimensions have an impact on the accuracy of the approximation and on the performance of the Multidimensional Cluster Sampling View. The results of the evaluation show that the effects of dimensionality are not visible.

[1]  Shantanu H. Joshi,et al.  Materialized Sample Views for Database Approximation , 2008, IEEE Trans. Knowl. Data Eng..

[2]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[3]  Chris Jermaine,et al.  Robust Stratified Sampling Plans for Low Selectivity Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[4]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[5]  Raj P. Gopalan,et al.  An Efficient Sampling Scheme for Approximate Processing of Decision Support Queries , 2012, ICEIS.

[6]  Lynne M. Webb,et al.  Techniques for Sampling Online Text-Based Data Sets , 2016 .

[7]  Tomohiro Inoue,et al.  Multidimensional Cluster Sampling View on Large Databases for Approximate Query Processing , 2015, 2015 IEEE 19th International Enterprise Distributed Object Computing Conference.

[8]  Hui Li,et al.  A Histogram Based Analytical Approximate Query Processing for Massive Data , 2013 .

[9]  Arijtt Chaudhuri,et al.  DOMAIN ESTIMATION IN FINITE POPULATIONS1 , 1985 .

[10]  Jae-Gil Lee,et al.  Sampling cube: a framework for statistical olap over sampling data , 2008, SIGMOD Conference.

[11]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[12]  Qing Liu Approximate Query Processing , 2009, Encyclopedia of Database Systems.

[13]  Doron Rotem,et al.  Random sampling from databases: a survey , 1995 .

[14]  Ralph Kimball,et al.  The Data Warehouse Lifecycle Toolkit , 2009 .

[15]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[16]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[17]  Ruoming Jin,et al.  New Sampling-Based Estimators for OLAP Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).