Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

Complex data pipelines are increasingly common in diverse applications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, it is widely reported that in complex production pipelines, upstream data feeds can change in unexpected ways, causing downstream applications to break silently that are expensive to resolve. Data validation has thus become an important topic, as evidenced by notable recent efforts from Google and Amazon, where the objective is to catch data quality issues early as they arise in the pipelines. Our experience on production data suggests, however, that on string-valued data, these existing approaches yield high false-positive rates and frequently require human intervention. In this work, we develop a corpus-driven approach to auto-validate machine-generated data by inferring suitable data-validation "patterns'' that accurately describe the underlying data-domain, which minimizes false-positives while maximizing data quality issues caught. Evaluations using production data from real data lakes suggest that \sj is substantially more effective than existing methods. Part of this technology ships as an Auto-Tag feature in Microsoft Azure Purview.

[1]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[2]  Sriram Vasudevan,et al.  Data Sentinel: A Declarative Production-Scale Data Validation Platform , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[3]  Robert Gruber,et al.  PADS: a domain-specific language for processing ad hoc data , 2005, PLDI '05.

[4]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[5]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[6]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[7]  William W. Cohen,et al.  Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[8]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[9]  Raul Castro Fernandez,et al.  Extracting Syntactical Patterns from Databases , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[10]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[11]  Richard M. Karp,et al.  A fast parallel algorithm for the maximal independent set problem , 1985, JACM.

[12]  Patricia S. O Sullivan,et al.  100 Statistical Tests , 1995 .

[13]  Alekh Jindal,et al.  Big Data Processing at Microsoft: Hyper Scale, Massive Complexity, and Minimal Cost , 2019, SoCC.

[14]  Heikki Mannila,et al.  Approximate Inference of Functional Dependencies from Relations , 1995, Theor. Comput. Sci..

[15]  Sebastian Schelter,et al.  Differential Data Quality Verification on Partitioned Data , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[16]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[17]  Felix Bießmann,et al.  Automating Large-Scale Data Quality Verification , 2018, Proc. VLDB Endow..

[18]  Reynold Cheng,et al.  SCODED: Statistical Constraint Oriented Data Error Detection , 2020, SIGMOD Conference.

[19]  Felix Naumann,et al.  Data profiling revisited , 2014, SGMD.

[20]  Felix Bießmann,et al.  Unit Testing Data with Deequ , 2019, SIGMOD Conference.

[21]  Winfried Just,et al.  Computational Complexity of Multiple Sequence Alignment with SP-Score , 2001, J. Comput. Biol..

[22]  D. Sculley,et al.  The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets , 2017 .

[23]  Theodoros Rekatsinas,et al.  HoloDetect: Few-Shot Learning for Error Detection , 2019, SIGMOD Conference.

[24]  Yeye He,et al.  Uni-Detect: A Unified Approach to Automated Error Detection in Tables , 2019, SIGMOD Conference.

[25]  Michael Stonebraker,et al.  Raha: A Configuration-Free Error Detection System , 2019, SIGMOD Conference.

[26]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[27]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[28]  W. Tan,et al.  Sato , 2019, Proc. VLDB Endow..

[29]  Yeye He,et al.  Auto-Detect: Data-Driven Error Detection in Tables , 2018, SIGMOD Conference.

[30]  Neoklis Polyzotis,et al.  Data Management Challenges in Production Machine Learning , 2017, SIGMOD Conference.

[31]  Michael Stonebraker,et al.  Data Integration: The Current Status and the Way Forward , 2018, IEEE Data Eng. Bull..

[32]  A. Agresti [A Survey of Exact Inference for Contingency Tables]: Rejoinder , 1992 .

[33]  Tim Kraska,et al.  Sherlock: A Deep Learning Approach to Semantic Data Type Detection , 2019, KDD.

[34]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[35]  Panos Vassiliadis,et al.  Near Real Time ETL , 2009, New Trends in Data Warehousing and Data Analysis.

[36]  Norman W. Paton,et al.  Dataset Discovery in Data Lakes , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[37]  Zifan Liu,et al.  Picket: Self-supervised Data Diagnostics for ML Pipelines , 2020, ArXiv.

[38]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[39]  Felix Naumann,et al.  A Hybrid Approach to Functional Dependency Discovery , 2016, SIGMOD Conference.

[40]  Sumit Gulwani,et al.  FlashProfile: a framework for synthesizing data profiles , 2017, Proc. ACM Program. Lang..

[41]  Kevin Wilkinson,et al.  Data integration flows for business intelligence , 2009, EDBT '09.

[42]  Richard M. Karp,et al.  A fast parallel algorithm for the maximal independent set problem , 1984, STOC '84.

[43]  Neoklis Polyzotis,et al.  Data Validation for Machine Learning , 2019, SysML.

[44]  Cong Yan,et al.  Synthesizing Type-Detection Logic for Rich Semantic Data Types using Open-source Code , 2018, SIGMOD Conference.

[45]  David Walker,et al.  From dirt to shovels: fully automatic tool generation from ad hoc data , 2008, POPL '08.

[46]  Michael Stonebraker,et al.  ANMAT: Automatic Knowledge Discovery and Error Detection through Pattern Functional Dependencies , 2019, SIGMOD Conference.

[47]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[48]  Yeye He,et al.  ClusterJoin: A Similarity Joins Framework using Map-Reduce , 2014, Proc. VLDB Endow..

[49]  Felix Naumann,et al.  Discovery of Genuine Functional Dependencies from Relational Data with Missing Values , 2018, Proc. VLDB Endow..

[50]  NaumannFelix,et al.  Discovery of genuine functional dependencies from relational data with missing values , 2018, VLDB 2018.

[51]  Gang Chen,et al.  Metric Similarity Joins Using MapReduce , 2017, IEEE Transactions on Knowledge and Data Engineering.

[52]  Yeye He,et al.  SEISA: set expansion by iterative similarity aggregation , 2011, WWW.