AlphaClean: Automatic Generation of Data Cleaning Pipelines

The analyst effort in data cleaning is gradually shifting away from the design of hand-written scripts to building and tuning complex pipelines of automated data cleaning libraries. Hyper-parameter tuning for data cleaning is very different than hyper-parameter tuning for machine learning since the pipeline components and objective functions have structure that tuning algorithms can exploit. This paper proposes a framework, called AlphaClean, that rethinks parameter tuning for data cleaning pipelines. AlphaClean provides users with a rich library to define data quality measures with weighted sums of SQL aggregate queries. AlphaClean applies generate-then-search framework where each pipelined cleaning operator contributes candidate transformations to a shared pool. Asynchronously, in separate threads, a search algorithm sequences them into cleaning pipelines that maximize the user-defined quality measures. This architecture allows AlphaClean to apply a number of optimizations including incremental evaluation of the quality measures and learning dynamic pruning rules to reduce the search space. Our experiments on real and synthetic benchmarks suggest that AlphaClean finds solutions of up-to 9x higher quality than naively applying state-of-the-art parameter tuning methods, is significantly more robust to straggling data cleaning methods and redundancy in the data cleaning library, and can incorporate state-of-the-art cleaning systems such as HoloClean as cleaning operators.

[1]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[2]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[3]  Purnamrita Sarkar,et al.  Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning , 2014, Proc. VLDB Endow..

[4]  H. V. Jagadish,et al.  Foofah: Transforming Data By Example , 2017, SIGMOD Conference.

[5]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[6]  Sanjay Krishnan,et al.  Wisteria: Nurturing Scalable Data Cleaning Infrastructure , 2015, Proc. VLDB Endow..

[7]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[8]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[9]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[10]  Fotis Psallidas,et al.  Combining Design and Performance in a Data Visualization Management System , 2017, CIDR.

[11]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[12]  Tim Kraska,et al.  SampleClean: Fast and Reliable Analytics on Dirty Data , 2015, IEEE Data Eng. Bull..

[13]  Alin Deutsch,et al.  The chase revisited , 2008, PODS.

[14]  Benjamin Recht,et al.  KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[15]  Jianzhong Li,et al.  Incremental Detection of Inconsistencies in Distributed Data , 2014, IEEE Trans. Knowl. Data Eng..

[16]  Alfred V. Aho,et al.  The theory of joins in relational data bases , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[17]  David D. Cox,et al.  Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms , 2013, SciPy.

[18]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[19]  Paolo Papotti,et al.  Interactive and Deterministic Data Cleaning , 2016, SIGMOD Conference.

[20]  Samuel Madden,et al.  MacroBase: Prioritizing Attention in Fast Data , 2016, SIGMOD Conference.

[21]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[22]  Ahmed K. Elmagarmid,et al.  Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes , 2013, SIGMOD '13.

[23]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[24]  H. V. Jagadish,et al.  Data Integration using Self-Maintainable Views , 1996, EDBT.

[25]  Yue Zhao,et al.  PyOD: A Python Toolbox for Scalable Outlier Detection , 2019, J. Mach. Learn. Res..

[26]  Tova Milo,et al.  Query-Oriented Data Cleaning with Oracles , 2015, SIGMOD Conference.

[27]  Paolo Papotti,et al.  BigDansing: A System for Big Data Cleansing , 2015, SIGMOD Conference.

[28]  Sam Madden,et al.  Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion , 2016 .

[29]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[30]  Dmitri V. Kalashnikov,et al.  Progressive Approach to Relational Entity Resolution , 2014, Proc. VLDB Endow..

[31]  Ihab F. Ilyas,et al.  Trends in Cleaning Relational Data: Consistency and Deduplication , 2015, Found. Trends Databases.

[32]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[33]  Tim Kraska,et al.  A Data Quality Metric (DQM): How to Estimate the Number of Undetected Errors in Data Sets , 2016, Proc. VLDB Endow..

[34]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[35]  Leopoldo E. Bertossi,et al.  Database Repairing and Consistent Query Answering , 2011, Database Repairing and Consistent Query Answering.

[36]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[37]  Tim Kraska,et al.  A sample-and-clean framework for fast and accurate query processing on dirty data , 2014, SIGMOD Conference.

[38]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[39]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[40]  AnHai Doan,et al.  Toward a System Building Agenda for Data Integration (and Data Science) , 2018, IEEE Data Eng. Bull..

[41]  Sanjay Krishnan,et al.  Towards reliable interactive data cleaning: a user survey and recommendations , 2016, HILDA '16.

[42]  Samuel Madden,et al.  Scorpion: Explaining Away Outliers in Aggregate Queries , 2013, Proc. VLDB Endow..

[43]  Ion Stoica,et al.  Tune: A Research Platform for Distributed Model Selection and Training , 2018, ArXiv.

[44]  Hotham Altwaijry,et al.  QuERy: A Framework for Integrating Entity Resolution with Query Processing , 2015, Proc. VLDB Endow..

[45]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.