论文信息 - Wisteria: Nurturing Scalable Data Cleaning Infrastructure

Wisteria: Nurturing Scalable Data Cleaning Infrastructure

Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst's choice of physical implementation. We highlight research challenges in sampling, in-flight operator replacement, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to showcase how Wisteria can improve iterative data analysis and cleaning. The code is available at: http://www.sampleclean.org.

[1] Sunil Prabhakar,et al. ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[2] Jeffrey Heer,et al. Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[3] Jeffrey Heer,et al. Enterprise Data Analysis and Visualization: An Interview Study , 2012, IEEE Transactions on Visualization and Computer Graphics.

[4] Ahmed Eldawy,et al. NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[5] Ruben Verborgh,et al. Using OpenRefine , 2013 .

[6] Michael Stonebraker,et al. Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[7] Jeffrey F. Naughton,et al. Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[8] Jennifer Widom,et al. CrowdFill: collecting structured data from the crowd , 2014, SIGMOD Conference.

[9] Zhe Chen,et al. Integrating spreadsheet data via accurate and low-effort extraction , 2014, KDD.

[10] Tim Kraska,et al. A sample-and-clean framework for fast and accurate query processing on dirty data , 2014, SIGMOD Conference.

[11] Ion Stoica,et al. The Power of Choice in Data-Aware Cluster Scheduling , 2014, OSDI.

[12] Tim Kraska,et al. Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views , 2015, Proc. VLDB Endow..