Automating Large-Scale Data Quality Verification

Modern companies and institutions rely on data to guide every single business process and decision. Missing or incorrect information seriously compromises any decision process downstream. Therefore, a crucial, but tedious task for everyone involved in data processing is to verify the quality of their data. We present a system for automating the verification of data quality at scale, which meets the requirements of production use cases. Our system provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables 'unit tests' for data. We efficiently execute the resulting constraint validation workload by translating it to aggregation queries on Apache Spark. Our platform supports the incremental validation of data quality on growing datasets, and leverages machine learning, e.g., for enhancing constraint suggestions, for estimating the 'predictability' of a column, and for detecting anomalies in historic data quality time series. We discuss our design decisions, describe the resulting system architecture, and present an experimental evaluation on various datasets.

[1]  Joseph M. Hellerstein,et al.  Ground: A Data Context Service , 2017, CIDR.

[2]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[3]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[4]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[5]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[6]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[7]  Manasi Vartak,et al.  ModelDB: a system for machine learning model management , 2016, HILDA '16.

[8]  Joos-Hendrik Böse,et al.  Probabilistic Demand Forecasting at Scale , 2017, Proc. VLDB Endow..

[9]  D. Sculley,et al.  The ML test score: A rubric for ML production readiness and technical debt reduction , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[10]  Sanjay Krishnan,et al.  ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning , 2016, SIGMOD Conference.

[11]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[12]  Sebastian Schelter,et al.  Automatically Tracking Metadata and Provenance of Machine Learning Experiments , 2017 .

[13]  Benjamin Recht,et al.  KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[14]  Sebastian Link,et al.  Data Quality: The Role of Empiricism , 2018, SGMD.

[15]  Felix Naumann,et al.  A Hybrid Approach to Functional Dependency Discovery , 2016, SIGMOD Conference.

[16]  Samridhi Jha Data Infrastructure for Machine Learning , 2019 .

[17]  D. Sculley,et al.  The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets , 2017 .

[18]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[19]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[20]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[21]  Felix Naumann,et al.  Profiling relational data: a survey , 2015, The VLDB Journal.

[22]  Joseph M. Hellerstein,et al.  Quantitative Data Cleaning for Large Databases , 2008 .

[23]  Valentin Flunkert,et al.  DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks , 2017, International Journal of Forecasting.

[24]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[25]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[26]  Juliana Freire,et al.  noWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts , 2017, Proc. VLDB Endow..

[27]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[28]  Amol Deshpande,et al.  On Model Discovery For Hosted Data Science Projects , 2017, DEEM@SIGMOD.

[29]  Sebastian Schelter,et al.  Declarative Metadata Management : A Missing Piece in End-To-End Machine Learning , 2018 .

[30]  Sanjay Krishnan,et al.  BoostClean: Automated Error Detection and Repair for Machine Learning , 2017, ArXiv.

[31]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Michal Zielinski,et al.  Versioning for End-to-End Machine Learning Pipelines , 2017, DEEM@SIGMOD.

[33]  Ihab F. Ilyas,et al.  Trends in Cleaning Relational Data: Consistency and Deduplication , 2015, Found. Trends Databases.

[34]  Larry S. Davis,et al.  Towards Unified Data and Lifecycle Management for Deep Learning , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[35]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[36]  Felix Naumann,et al.  Quality-Driven Query Answering for Integrated Information Systems , 2002, Lecture Notes in Computer Science.

[37]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[38]  George Athanasopoulos,et al.  Forecasting: principles and practice , 2013 .

[39]  Cédric Archambeau,et al.  An interpretable latent variable model for attribute applicability in the Amazon catalogue , 2017, ArXiv.

[40]  Jeffrey F. Naughton,et al.  Model Selection Management Systems: The Next Frontier of Advanced Analytics , 2016, SGMD.

[41]  J. Manthorpe Land Registration and Land Valuation in the United Kingdom and in the Countries of the United Nations Economic Commission for Europe (UNECE) , 1998 .

[42]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[43]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[44]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[45]  Neoklis Polyzotis,et al.  Data Management Challenges in Production Machine Learning , 2017, SIGMOD Conference.

[46]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[47]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[48]  KhannaSanjeev,et al.  Space-efficient online computation of quantile summaries , 2001 .

[49]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[50]  Felix Naumann,et al.  Cardinality Estimation: An Experimental Survey , 2017, Proc. VLDB Endow..