Efficient histogram-based range query estimation for dirty data

In recent years, data quality issues have attracted wide attentions. Data quality problems are mainly caused by dirty data. Currently, many methods for dirty data management have been proposed, and one of them is entity-based relational database in which one tuple represents an entity. The traditional query optimizations are not suitable for the new entity-based model. Then new query optimizations need to be developed. In this paper, we propose a new query selectivity estimation strategy based on histogram, and focus on solving the overestimation which traditional methods lead to. We prove our approaches are unbiased. The experimental results on both real and synthetic data sets show that our approaches can give good estimates with low error.

[1]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[2]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[3]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[4]  Jianzhong Li,et al.  EntityManager: An Entity-Based Dirty Data Management System , 2013, DASFAA.

[5]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[6]  Kelvin D. Nilsen,et al.  Adding real-time capabilities to Java , 1998, CACM.

[7]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[8]  Serge Abiteboul,et al.  On the Representation and Querying of Sets of Possible Worlds , 1991, Theor. Comput. Sci..

[9]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[10]  Muhammad Aamir Cheema,et al.  Efficient top-k similarity join processing over multi-valued objects , 2014, World Wide Web.

[11]  Stanley B. Zdonik,et al.  Top-k queries on uncertain data: on score distribution and typical answers , 2009, SIGMOD Conference.

[12]  Yan Zhang,et al.  Range Query Estimation for Dirty Data Management System , 2012, WAIM.

[13]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[14]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[15]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, Proc. VLDB Endow..

[16]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[17]  A. Raman,et al.  Execution: The Missing Link in Retail Operations , 2001 .

[18]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[19]  Yufei Tao,et al.  Range search on multidimensional uncertain data , 2007, TODS.

[20]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD '00.

[21]  Sunil Prabhakar,et al.  Threshold query optimization for uncertain data , 2010, SIGMOD Conference.

[22]  Renée J. Miller,et al.  First-order query rewriting for inconsistent databases , 2005, J. Comput. Syst. Sci..

[23]  Robert Kooi,et al.  The Optimization of Queries in Relational Databases , 1980 .

[24]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[25]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[26]  Xiaoyong Du,et al.  Approximate entity extraction in temporal databases , 2011, World Wide Web.

[27]  Graham Cormode,et al.  Probabilistic Histograms for Probabilistic Data , 2009, Proc. VLDB Endow..

[28]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[29]  Guoren Wang,et al.  Query Processing and Optimization Techniques over Streamed Fragmented XML , 2007, World Wide Web.

[30]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[31]  Renée J. Miller,et al.  Creating probabilistic databases from duplicated data , 2009, The VLDB Journal.

[32]  Jeffrey F. Naughton,et al.  Query Size Estimation by Adaptive Sampling , 1995, J. Comput. Syst. Sci..

[33]  H. V. Jagadish,et al.  ProTDB: Probabilistic Data in XML , 2002, VLDB.

[34]  Anne H. H. Ngu,et al.  Query Size Estimation for Joins Using Systematic Sampling , 2004, Distributed and Parallel Databases.

[35]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, The VLDB Journal.

[36]  Denilson Barbosa,et al.  Studying the XML Web: Gathering Statistics from an XML Sample , 2005, World Wide Web.

[37]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[38]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[39]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[40]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[41]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[42]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[43]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[44]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[45]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[46]  Hongzhi Wang,et al.  Efficient Entity Resolution Based on Sequence Rules , 2011, CSIE 2011.

[47]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[48]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[49]  Bernhard Seeger,et al.  Efficient Computation of Reverse Skyline Queries , 2007, VLDB.

[50]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[51]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[52]  Thierry Dutoit,et al.  Continuous Control of Style and Style Transitions through Linear Interpolation in Hidden Markov Model Based Walk Synthesis , 2012, Trans. Comput. Sci..

[53]  Graham Cormode,et al.  Histograms and Wavelets on Probabilistic Data , 2010, IEEE Trans. Knowl. Data Eng..

[54]  Subbarao Kambhampati,et al.  Query processing over incomplete autonomous databases: query rewriting using learned data dependencies , 2009, The VLDB Journal.

[55]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.