Interactive Cleaning for Progressive Visualization through Composite Questions

In this paper, we study the problem of interactive cleaning for progressive visualization (ICPV): Given a bad visualization V , it is to obtain a "cleaned" visualization V whose distance is far from V , under a given (small) budget w.r.t. human cost. In ICPV, a system interacts with a user iteratively. During each iteration, it asks the user a data cleaning question such as "how to clean detected errors x?", and takes value updates from the user to clean V . Conventional wisdom typically picks a single question (e.g., "Are SIGMOD conference and SIGMOD the same?") with the maximum expected benefit in each iteration. We propose to use a composite question – i.e., a group of single questions to be treated as one question – in each iteration (for example, Are SIGMOD conference in t1 and SIGMOD in t2 the same value, and are t1 and t2 duplicates?). A composite question is presented to the user as a small connected graph through a novel GUI that the user can directly operate on. We propose algorithms to select the best composite question in each iteration. Experiments on real-world datasets verify that composite questions are more effective than asking single questions in isolation w.r.t. the human cost.

[1]  Guoliang Li,et al.  Crowdsourcing Database Systems: Overview and Challenges , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[2]  Aditya G. Parameswaran,et al.  SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics , 2015, Proc. VLDB Endow..

[3]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[5]  Jian Li,et al.  Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach , 2016, SIGMOD Conference.

[6]  Yue Wang,et al.  Synthesizing Mapping Relationships Using Table Corpus , 2017, SIGMOD Conference.

[7]  Tim Kraska,et al.  Toward Sustainable Insights, or Why Polygamy is Bad for You , 2017, CIDR.

[8]  Divesh Srivastava,et al.  Combining Quantitative and Logical Data Cleaning , 2015, Proc. VLDB Endow..

[9]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[10]  Rishabh Singh,et al.  BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations , 2016, Proc. VLDB Endow..

[11]  Hotham Altwaijry,et al.  Query-Driven Approach to Entity Resolution , 2013, Proc. VLDB Endow..

[12]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[13]  Guoliang Li,et al.  DeepEye: Creating Good Data Visualizations by Keyword Search , 2018, SIGMOD Conference.

[14]  Paolo Papotti,et al.  Synthesizing Entity Matching Rules by Examples , 2017, Proc. VLDB Endow..

[15]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[16]  Jian Li,et al.  CDB: Optimizing Queries with Crowd-Based Selections and Joins , 2017, SIGMOD Conference.

[17]  Guoliang Li,et al.  DeepEye: Towards Automatic Data Visualization , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[18]  Michael Stonebraker,et al.  Temporal Rules Discovery for Web Data Cleaning , 2015, Proc. VLDB Endow..

[19]  Ronitt Rubinfeld,et al.  Rapid Sampling for Visualizations with Ordering Guarantees , 2014, Proc. VLDB Endow..

[20]  AnHai Doan,et al.  Technical Perspective:: Toward Building Entity Matching Management Systems , 2016, SGMD.

[21]  Carlos Eduardo Scheidegger,et al.  Selective Wander Join: Fast Progressive Visualizations for Data Joins , 2019, Informatics.

[22]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[23]  Tim Kraska,et al.  How Progressive Visualizations Affect Exploratory Analysis , 2017, IEEE Transactions on Visualization and Computer Graphics.

[24]  Matthew Kay,et al.  In Pursuit of Error: A Survey of Uncertainty Visualization Evaluation , 2019, IEEE Transactions on Visualization and Computer Graphics.

[25]  Guoliang Li,et al.  Making data visualization more efficient and effective: a survey , 2019, The VLDB Journal.

[26]  Tova Milo,et al.  Query-Oriented Data Cleaning with Oracles , 2015, SIGMOD Conference.

[27]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[28]  Guoliang Li,et al.  DeepEye: Visualizing Your Data by Keyword Search , 2018, EDBT.

[29]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[30]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[31]  Michael Stonebraker,et al.  DataXFormer: A robust transformation discovery system , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[32]  Michael Stonebraker,et al.  Unsupervised String Transformation Learning for Entity Consolidation , 2017, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[33]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[34]  Guoliang Li,et al.  DeepEye: An automatic big data visualization framework , 2018, Big Data Min. Anal..

[35]  Yeye He,et al.  Auto-Join: Joining Tables by Leveraging Transformations , 2017, Proc. VLDB Endow..

[36]  Arvind Satyanarayan,et al.  Vega-Lite: A Grammar of Interactive Graphics , 2018, IEEE Transactions on Visualization and Computer Graphics.

[37]  Guoliang Li,et al.  Crowdsourced Data Management: A Survey , 2016, IEEE Transactions on Knowledge and Data Engineering.

[38]  Maximilien Danisch,et al.  Finding Heaviest k-Subgraphs and Events in Social Media , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[39]  Guoliang Li,et al.  Towards Democratizing Relational Data Visualization , 2019, SIGMOD Conference.

[40]  Nan Tang,et al.  Towards dependable data repairing with fixing rules , 2014, SIGMOD Conference.

[41]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[42]  Guoliang Li,et al.  A partial-order-based framework for cost-effective crowdsourced entity resolution , 2018, The VLDB Journal.