Cleaning uncertain graphs via noisy crowdsourcing

Uncertain graph is an important data model for many real-world applications. To answer the query on the uncertain graphs, the edges in these graphs are associated with existential probabilities that represent the likelihood of the existence of the edge. Almost all works on this area focus on how to promote the efficiency of the query processing. However, another issue should be notable, that is, the query results from the uncertain graphs are sometimes uninformative due to the edge uncertainty. We adopt a crowdsourcing-based approach to make the query results more informative. To save the monetary and time cost of crowdsourcing, we should select the optimal edges to clean to maximize the quality improvement. However, the noise of the crowdsourcing results will make the problem more complex. We prove that the problem is #P-hard and propose an efficient algorithm to derive the optimal edge. Our experimental results show that our proposed algorithm outperforms random-selection up to 22 times in quality improvement and each-edge-comparison way up to 5 times fast in elapsed time, which proves this algorithm is both effective and efficient.

[1]  Jian Pei,et al.  Efficient Skyline and Top-k Retrieval in Subspaces , 2007, IEEE Transactions on Knowledge and Data Engineering.

[2]  Michael O. Ball,et al.  Computational Complexity of Network Reliability Analysis: An Overview , 1986, IEEE Transactions on Reliability.

[3]  Lei Chen,et al.  Reducing Uncertainty of Schema Matching via Crowdsourcing , 2013, Proc. VLDB Endow..

[4]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[5]  Xiang Li,et al.  Cleaning uncertain data for top-k queries , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[6]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[7]  Lei Chen,et al.  Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[8]  Dmitry Efimov,et al.  KDD Cup 2013 - author-paper identification challenge: second place team , 2013, KDD Cup '13.

[9]  Yun Peng,et al.  Human-Powered Data Cleaning for Probabilistic Reachability Queries on Uncertain Graphs , 2017, IEEE Transactions on Knowledge and Data Engineering.

[10]  Richard M. Karp,et al.  A New Monte-Carlo Method for Estimating the Failure Probability of an , 1983 .

[11]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[12]  Dmitry Efimov,et al.  KDD Cup 2013: author disambiguation , 2013, KDD Cup '13.

[13]  Lei Zou,et al.  Answering label-constraint reachability in large graphs , 2011, CIKM '11.

[14]  Jianliang Xu,et al.  Range-Based Skyline Queries in Mobile Environments , 2013, IEEE Transactions on Knowledge and Data Engineering.

[15]  Lei Chen,et al.  Cleaning uncertain data with a noisy crowd , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[16]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  Xike Xie,et al.  Cleaning uncertain data with quality guarantees , 2008, Proc. VLDB Endow..

[18]  Yang Xiang,et al.  Computing label-constraint reachability in graph databases , 2010, SIGMOD Conference.

[19]  David R. Karger,et al.  Human-powered Sorts and Joins , 2011, Proc. VLDB Endow..

[20]  James Cheng,et al.  TF-Label: a topological-folding labeling scheme for reachability querying in a large graph , 2013, SIGMOD '13.

[21]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[22]  Lei Chen,et al.  On Uncertain Graphs Modeling and Queries , 2015, Proc. VLDB Endow..

[23]  Reynold Cheng,et al.  Querying and Cleaning Uncertain Data , 2009, QuaCon.

[24]  Lei Chen,et al.  CrowdCleaner: Data cleaning for multi-version data on the web via crowdsourcing , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[25]  Daren C. Brabham Crowdsourcing as a Model for Problem Solving , 2008 .

[26]  Haixun Wang,et al.  Distance-Constraint Reachability Computation in Uncertain Graphs , 2011, Proc. VLDB Endow..

[27]  Hans-Peter Kriegel,et al.  Reverse-Nearest Neighbor Queries on Uncertain Moving Object Trajectories , 2014, DASFAA.

[28]  Hector Garcia-Molina,et al.  Entity Resolution with crowd errors , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[29]  Philip S. Yu,et al.  Mining Frequent Itemsets over Uncertain Databases , 2012, Proc. VLDB Endow..

[30]  Hans-Peter Kriegel,et al.  Probabilistic Nearest Neighbor Queries on Uncertain Moving Object Trajectories , 2013, Proc. VLDB Endow..

[31]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[32]  Ge Yu,et al.  Label and Distance-Constraint Reachability Queries in Uncertain Graphs , 2014, DASFAA.

[33]  George S. Fishman A Comparison of Four Monte Carlo Methods for Estimating the Probability of s-t Connectedness , 1986, IEEE Transactions on Reliability.

[34]  Jennifer Widom Chapter 5 TRIO:ASYSTEMFORDATA,UNCERTAINTY,AND LINEAGE , .