Manually Detecting Errors for Data Cleaning Using Adaptive Crowdsourcing Strategies

Current work to detect data errors often uses (semi-)automatic solutions. In this paper, however, we argue that there are many real-world scenarios where users have to detect data errors completely manually, and that more attention should be devoted to this problem. We then study one instance of this problem in depth. Specifically, we focus on the problem of manually verifying the values of a target attribute, and shows that the current best solution in industry, which uses crowdsourcing, has significant limitations. We develop a new solution that addresses the above limitations. Our solution can find a much more accurate ranking of the data values in terms of their difficulties for crowdsourcing, can help domain experts debug this ranking, and can handle ambiguous values for which no golden answers exist. Importantly, our solution provides a unified framework that allows users to easily express and solve a broad range of optimization problems for crowdsourcing, to balance between cost and accuracy. Finally, we describe extensive experiments with three real-world data sets that demonstrate the utility and promise of our solution approach.

[1]  Rob Miller,et al.  Crowdsourced Databases: Query Processing with People , 2011, CIDR.

[2]  Yannis Papakonstantinou,et al.  Waldo: An Adaptive Human Interface for Crowd Entity Resolution , 2017, SIGMOD Conference.

[3]  Hector Garcia-Molina,et al.  CrowdDQS: Dynamic Question Selection in Crowdsourcing Systems , 2017, SIGMOD Conference.

[4]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[5]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[6]  J. Leeuw,et al.  Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods , 2009 .

[7]  Ashwin Machanavajjhala,et al.  An automatic blocking mechanism for large-scale de-duplication tasks , 2012, CIKM '12.

[8]  Shipeng Yu,et al.  Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks , 2012, J. Mach. Learn. Res..

[9]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[10]  Lei Chen,et al.  CrowdCleaner: Data cleaning for multi-version data on the web via crowdsourcing , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[11]  Bo Zhao,et al.  The wisdom of minority: discovering and targeting the right group of workers for crowdsourcing , 2014, WWW.

[12]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[13]  Surajit Chaudhuri,et al.  Data Debugger: An Operator-Centric Approach for Data Quality Solutions , 2006, IEEE Data Eng. Bull..

[14]  George Papastefanatos,et al.  Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[15]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[16]  Jennifer Widom,et al.  Query Optimization over Crowdsourced Data , 2013, Proc. VLDB Endow..

[17]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[18]  Paolo Papotti,et al.  BigDansing: A System for Big Data Cleansing , 2015, SIGMOD Conference.

[19]  Eugene Wu,et al.  CLAMShell: Speeding up Crowds for Low-latency Data Labeling , 2015, Proc. VLDB Endow..

[20]  Beng Chin Ooi,et al.  CDAS: A Crowdsourcing Data Analytics System , 2012, Proc. VLDB Endow..

[21]  Reynold Cheng,et al.  Optimizing Task Assignment for Crowdsourcing Environments , 2013 .

[22]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[23]  Andreas Thor,et al.  Parallel Sorted Neighborhood Blocking with MapReduce , 2011, BTW.

[24]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[25]  Sanjay Krishnan,et al.  Wisteria: Nurturing Scalable Data Cleaning Infrastructure , 2015, Proc. VLDB Endow..

[26]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[27]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[28]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[29]  Jeffrey Heer,et al.  Predictive Interaction for Data Transformation , 2015, CIDR.

[30]  Kai Zhao,et al.  Exploring What not to Clean in Urban Data: A Study Using New York City Taxi Trips , 2016, IEEE Data Eng. Bull..

[31]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[32]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[33]  Pierre Senellart,et al.  CrowdMiner: Mining association rules from the crowd , 2013, Proc. VLDB Endow..

[34]  AnHai Doan,et al.  Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services , 2017, SIGMOD Conference.

[35]  Purnamrita Sarkar,et al.  Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning , 2014, Proc. VLDB Endow..

[36]  Ihab F. Ilyas,et al.  Distributed Data Deduplication , 2016, Proc. VLDB Endow..

[37]  Lakshminarayanan Subramanian,et al.  Reputation-based Worker Filtering in Crowdsourcing , 2014, NIPS.

[38]  Hisashi Kashima,et al.  Accurate Integration of Crowdsourced Labels Using Workers' Self-reported Confidence Scores , 2013, IJCAI.

[39]  Qili Deng,et al.  Deep learning for gender recognition , 2015, 2015 International Conference on Computers, Communications, and Systems (ICCCS).

[40]  Surajit Chaudhuri,et al.  Towards a Domain Independent Platform for Data Cleaning , 2011, IEEE Data Eng. Bull..

[41]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.

[42]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..

[43]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[44]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[45]  Michael S. Bernstein,et al.  Measuring Crowdsourcing Effort with Error-Time Curves , 2015, CHI.

[46]  Jennifer Widom,et al.  CrowdScreen: algorithms for filtering data with humans , 2012, SIGMOD Conference.

[47]  Aditya G. Parameswaran,et al.  Answering Queries using Humans, Algorithms and Databases , 2011, CIDR.

[48]  Ihab F. Ilyas,et al.  Qualitative Data Cleaning , 2016, Proc. VLDB Endow..