Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints

Today, data analysts largely rely on intuition to determine whether missing or withheld rows of a dataset significantly affect their analyses. We propose a framework that can produce automatic contingency analysis, i.e., the range of values an aggregate SQL query could take, under formal constraints describing the variation and frequency of missing data tuples. We describe how to process SUM, COUNT, AVG, MIN, and MAX queries in these conditions resulting in hard error bounds with testable constraints. We propose an optimization algorithm based on an integer program that reconciles a set of such constraints, even if they are overlapping, conflicting, or unsatisfiable, into such bounds. Our experiments on real-world datasets against several statistical imputation and inference baselines show that statistical techniques can have a deceptively high error rate that is often unpredictable. In contrast, our framework offers hard bounds that are guaranteed to hold if the constraints are not violated. In spite of these hard bounds, we show competitive accuracy to statistical baselines.

[1]  Yannis Papakonstantinou,et al.  Efficient Approximate Query Answering over Sensor Data with Deterministic Error Guarantees , 2017, ArXiv.

[2]  Srikanth Kandula,et al.  Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters , 2016, SIGMOD Conference.

[3]  Dan Suciu,et al.  Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities , 2019, SIGMOD Conference.

[4]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[5]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[6]  Boris Glavic,et al.  Analyzing Uncertain Tabular Data , 2019, Information Quality in Information Fusion and Decision Making.

[7]  E IoannidisYannis,et al.  Improved histograms for selectivity estimation of range predicates , 1996 .

[8]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[9]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[10]  Ehud Friedgut,et al.  Hypergraphs, Entropy, and Inequalities , 2004, Am. Math. Mon..

[11]  Jeffrey F. Naughton,et al.  Exploiting Data Partitioning To Provide Approximate Results , 2018, BeyondMR@SIGMOD.

[12]  Dan Suciu,et al.  Reverse data management , 2011, Proc. VLDB Endow..

[13]  Qing Zhang,et al.  Aggregate Query Answering on Anonymized Tables , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[14]  T. S. Jayram,et al.  Efficient allocation algorithms for OLAP over imprecise data , 2006, VLDB.

[15]  Dawn Xiaodong Song,et al.  Towards Practical Differential Privacy for SQL Queries , 2017, Proc. VLDB Endow..

[16]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[17]  Daniel Deutch,et al.  Caravan: Provisioning for What-If Analysis , 2013, CIDR.

[18]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[19]  Dan Suciu,et al.  Tiresias: the database oracle for how-to queries , 2012, SIGMOD Conference.

[20]  Jeffrey F. Naughton,et al.  m-tables: Representing Missing Data , 2017, ICDT.

[21]  Tim Kraska,et al.  Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views , 2015, Proc. VLDB Endow..

[22]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[23]  Jennifer Widom,et al.  Adaptive precision setting for cached approximate values , 2001, SIGMOD '01.

[24]  Jignesh M. Patel,et al.  DAQ: A New Paradigm for Approximate Query Processing , 2015, Proc. VLDB Endow..

[25]  Tim Kraska,et al.  SampleClean: Fast and Reliable Analytics on Dirty Data , 2015, IEEE Data Eng. Bull..

[26]  Tim Kraska,et al.  Northstar: An Interactive Data Science System , 2018, Proc. VLDB Endow..