Quality Management of Workers in an In-House Crowdsourcing-Based Framework for Deduplication of Organizations’ Databases

While organizations in the current era of big data are generating massive volumes of data, they also need to ensure that its quality is maintained for it to be useful in decision-making purposes. The problem of dirty data plagues every organization. One aspect of dirty data is the presence of duplicate data records that negatively impact the organization’s operations in many ways. Many existing approaches attempt to address this problem by using traditional data cleansing methods. In this paper, we address this problem by using an in-house crowdsourcing-based framework, namely, DedupCrowd. One of the main obstacles of crowdsourcing-based approaches is to monitor the performance of the crowd, by which the integrity of the whole process is maintained. In this paper, a statistical quality control-based technique is proposed to regulate the performance of the crowd. We apply our proposed framework in the context of a contact center, where the Customer Service Representatives are used as the crowd to assist in the process of deduplicating detection. By using comprehensive working examples, we show how the different modules of the DedupCrowd work not only to monitor the performance of the crowd but also to assist in duplicate detection.

[1]  Erhard Rahm,et al.  Distributed Privacy-Preserving Record Linkage Using Pivot-Based Filter Techniques , 2018, 2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW).

[2]  Purnamrita Sarkar,et al.  Active Learning for Crowd-Sourced Databases , 2012, ArXiv.

[3]  Tim Kraska,et al.  CrowdDB: Query Processing with the VLDB Crowd , 2011, Proc. VLDB Endow..

[4]  Tim Kraska,et al.  Crowdsourcing Applications and Platforms: A Data Management Perspective , 2011, Proc. VLDB Endow..

[5]  Vaibhavi N Patodkar,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2016 .

[6]  Alex A. Freitas,et al.  A review of performance evaluation measures for hierarchical classifiers , 2007 .

[7]  Gianluca Demartini,et al.  ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking , 2012, WWW.

[8]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[9]  John Kingsley Arthur,et al.  A Review of Data Cleansing Concepts – Achievable Goals and Limitations , 2013 .

[10]  Gerhard Weikum,et al.  Crowdsourced Entity Markup , 2013, CrowdSem.

[11]  Bao Sheng Loe,et al.  Validating the Quality of Crowdsourced Psychometric Personality Test Items , 2016, HCOMP.

[12]  Hector Garcia-Molina,et al.  Finding with the Crowd Anish , 2013 .

[13]  Xiaowei Wang,et al.  Distributed Human Computation Framework for Linked Data Co-reference Resolution , 2011, ESWC.

[14]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[15]  Hector Garcia-Molina,et al.  Question Selection for Crowd Entity Resolution , 2013, Proc. VLDB Endow..

[16]  Fred Spiring,et al.  Introduction to Statistical Quality Control , 2007, Technometrics.

[17]  John Kingsley Arthur,et al.  Analysis of Data Cleansing Approaches regarding Dirty Data - A Comparative Study , 2013 .

[18]  Christian Timmerer,et al.  Survey of web-based crowdsourcing frameworks for subjective quality assessment , 2014, 2014 IEEE 16th International Workshop on Multimedia Signal Processing (MMSP).

[19]  S. Caudill,et al.  Multi‐rule quality control for the age‐related eye disease study , 2008, Statistics in medicine.

[20]  Yang-chun Feng,et al.  The application of Student’s t-test in internal quality control of clinical laboratory , 2017 .

[21]  トーマス C. レッドマン,et al.  Data's Credibility Problem , 2014 .

[22]  Alexander Zipf,et al.  A taxonomy of quality assessment methods for volunteered and crowdsourced geographic information , 2018, Trans. GIS.

[23]  Sanjeev Khanna,et al.  Using the crowd for top-k and group-by queries , 2013, ICDT '13.

[24]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[25]  Sam Meek,et al.  A flexible framework for assessing the quality of crowdsourced data , 2014 .

[26]  Elizabeth Chang,et al.  Statistical Quality Control Framework for Crowd-Worker in ER-In-house Crowdsourcing System , 2015, ICIQ.

[27]  Aniket Kittur,et al.  Crowdsourcing and human computation: systems, studies and platforms , 2011, CHI EA '11.

[28]  Matthew Lease,et al.  Crowdsourcing and Human Computation, Introduction , 2014, Encyclopedia of Social Network Analysis and Mining.

[29]  Tharam S. Dillon,et al.  A Customer Relationship Management ecosystem that utilizes multiple sources and types of information conjointly , 2012, 2012 6th IEEE International Conference on Digital Ecosystems and Technologies (DEST).

[30]  Aditya G. Parameswaran,et al.  Evaluating the crowd with confidence , 2013, KDD.

[31]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[32]  Jeffrey P. Bigham,et al.  VizWiz: nearly real-time answers to visual questions , 2010, W4A.

[33]  Daren C. Brabham Crowdsourcing as a Model for Problem Solving , 2008 .

[34]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[35]  Miriam Catterall,et al.  Invisible data quality issues in a CRM implementation , 2005 .

[36]  Ittai Abraham,et al.  Crowdsourcing Gold-HIT Creation at Scale: Challenges and Adaptive Exploration Approaches , 2013 .

[37]  Jinfeng Yi,et al.  Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach , 2012, HCOMP@AAAI.

[38]  Shan Ling Pan,et al.  Using e-CRM for a unified view of the customer , 2003, CACM.