Estimating the extent of the effects of Data Quality through Observations

Existing data quality works have so far focused on the computation of many data characteristics as a mean of quantifying different quality dimensions, like freshness, consistency, accuracy, or completeness, that are all defined about some ideal (clean) dataset. We claim that this approach falls short in providing a full specification of the quality of the data since it does not take into consideration the task for which the data is to be used, neither any future instances of the dataset. We argue that apart from the difference from the clean dataset, it is equally important to know the degree to which such difference affects the results of the task at hand. Thus, we extend the existing data quality definition to include that degree. Our approach, not only allows data quality to be considered in the context of the intended task, but can also provide useful information even in the absence of the clean dataset, and proffer an understanding of the effect of data quality in future dataset instances. We describe a system and its implementation that computes this extended form of data quality through a principled approach of systematic noise generation and task result evaluation. We perform numerous experiments illustrating the effectiveness of the approach and how this allows contextualizing traditional data quality measures.

[1]  Sushil Jajodia,et al.  Data Synthesis based on Generative Adversarial Networks , 2018, Proc. VLDB Endow..

[2]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[3]  Paolo Papotti,et al.  Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms , 2015, Proc. VLDB Endow..

[4]  Carlo Batini,et al.  From Data Quality to Big Data Quality , 2015, J. Database Manag..

[5]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[6]  Jan Chomicki,et al.  Query Answering in Inconsistent Databases , 2003, Logics for Emerging Applications of Databases.

[7]  Felix Naumann,et al.  DFD: Efficient Functional Dependency Discovery , 2014, CIKM.

[8]  Felix Naumann,et al.  Discovering conditional inclusion dependencies , 2012, CIKM.

[9]  Tim Kraska,et al.  A sample-and-clean framework for fast and accurate query processing on dirty data , 2014, SIGMOD Conference.

[10]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[11]  Felix Naumann,et al.  Data Profiling with Metanome , 2015, Proc. VLDB Endow..

[12]  Wenfei Fan,et al.  Capturing missing tuples and missing values , 2010, PODS.

[13]  Divesh Srivastava,et al.  Data quality: The other face of Big Data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[14]  Jef Wijsen,et al.  Determining the Currency of Data , 2011, TODS.

[15]  Ihab F. Ilyas,et al.  Trends in Cleaning Relational Data: Consistency and Deduplication , 2015, Found. Trends Databases.

[16]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[17]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[18]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[19]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[20]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[21]  Tova Milo,et al.  Query-Oriented Data Cleaning with Oracles , 2015, SIGMOD Conference.

[22]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[23]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[24]  Felix Naumann,et al.  Profiling relational data: a survey , 2015, The VLDB Journal.

[25]  Wang Chiew Tan,et al.  STBenchmark: towards a benchmark for mapping systems , 2008, Proc. VLDB Endow..

[26]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[27]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..