论文信息 - CrowdCleaner: Data cleaning for multi-version data on the web via crowdsourcing

CrowdCleaner: Data cleaning for multi-version data on the web via crowdsourcing

Multi-version data is often one of the most concerned information on the Web since this type of data is usually updated frequently. Even though there exist some Web information integration systems that try to maintain the latest update version, the maintained multi-version data usually includes inaccurate and invalid information due to the data integration or update delay errors. In this demo, we present CrowdCleaner, a smart data cleaning system for cleaning multi-version data on the Web, which utilizes crowdsourcing-based approaches for detecting and repairing errors that usually cannot be solved by traditional data integration and cleaning techniques. In particular, CrowdCleaner blends active and passive crowdsourcing methods together for rectifying errors for multi-version data. We demonstrate the following four facilities provided by CrowdCleaner: (1) an error-monitor to find out which items (e.g., submission date, price of real estate, etc.) are wrong versions according to the reports from the crowds, which belongs to a passive crowdsourcing strategy; (2) a task-manager to allocate the tasks to human workers intelligently; (3) a smart-decision-maker to identify which answer from the crowds is correct with active crowdsourcing methods; and (4) a whom-to-ask-finder to discover which users (or human workers) should be the most credible according to their answer records.

Lei Chen | Chen Jason Zhang | Yongxin Tong | Caleb Chen Cao | Yatao Li

[1] Jianzhong Li,et al. Towards certain fixes with editing rules and master data , 2010, Proc. VLDB Endow..

[2] Aditya G. Parameswaran,et al. So who won?: dynamic max discovery with the crowd , 2012, SIGMOD Conference.

[3] Lei Chen,et al. Whom to Ask? Jury Selection for Decision Making Tasks on Micro-blog Services , 2012, Proc. VLDB Endow..

[4] Lei Chen,et al. WiseMarket: a new paradigm for managing wisdom of online social users , 2013, KDD.

[5] Lei Chen,et al. Reducing Uncertainty of Schema Matching via Crowdsourcing , 2013, Proc. VLDB Endow..

[6] David R. Karger,et al. Human-powered Sorts and Joins , 2011, Proc. VLDB Endow..

[7] Alon Y. Halevy,et al. Principles of Data Integration , 2012 .

[8] Jianzhong Li,et al. Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[9] Jennifer Widom,et al. CrowdScreen: algorithms for filtering data with humans , 2012, SIGMOD Conference.

[10] Feifei Li,et al. Optimal splitters for temporal and multi-version databases , 2013, SIGMOD '13.

[11] Divesh Srivastava,et al. Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[12] Qinghua Zheng,et al. Efficient Deep Web Crawling Using Reinforcement Learning , 2010, PAKDD.