A Confidence-Aware Approach for Truth Discovery on Long-Tail Data

In many real world applications, the same item may be described by multiple sources. As a consequence, conflicts among these sources are inevitable, which leads to an important task: how to identify which piece of information is trustworthy, i.e., the truth discovery task. Intuitively, if the piece of information is from a reliable source, then it is more trustworthy, and the source that provides trustworthy information is more reliable. Based on this principle, truth discovery approaches have been proposed to infer source reliability degrees and the most trustworthy information (i.e., the truth) simultaneously. However, existing approaches overlook the ubiquitous long-tail phenomenon in the tasks, i.e., most sources only provide a few claims and only a few sources make plenty of claims, which causes the source reliability estimation for small sources to be unreasonable. To tackle this challenge, we propose a confidence-aware truth discovery (CATD) method to automatically detect truths from conflicting data with long-tail phenomenon. The proposed method not only estimates source reliability, but also considers the confidence interval of the estimation, so that it can effectively reflect real source reliability for sources with various levels of participation. Experiments on four real world tasks as well as simulated multi-source long-tail datasets demonstrate that the proposed method outperforms existing state-of-the-art truth discovery approaches by successful discounting the effect of small sources.

[1]  Xiaoxin Yin,et al.  Semi-supervised truth discovery , 2011, WWW.

[2]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[3]  Guobin Shen,et al.  Walkie-Markie: Indoor Pathway Mapping Made Easy , 2013, NSDI.

[4]  Bo Zhao,et al.  Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation , 2014, SIGMOD Conference.

[5]  Dan Roth,et al.  Content-driven trust propagation framework , 2011, KDD.

[6]  Beng Chin Ooi,et al.  Online data fusion , 2011, Proc. VLDB Endow..

[7]  Moustafa Youssef,et al.  CrowdInside: automatic construction of indoor floorplans , 2012, SIGSPATIAL/GIS.

[8]  Wenfei Fan,et al.  Data Quality: Theory and Practice , 2012, WAIM.

[9]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[10]  Divesh Srivastava,et al.  Characterizing and selecting fresh data sources , 2014, SIGMOD Conference.

[11]  John C. Platt,et al.  Learning from the Wisdom of Crowds by Minimax Entropy , 2012, NIPS.

[12]  R. Adler,et al.  A practical guide to heavy tails: statistical techniques and applications , 1998 .

[13]  Felix Naumann,et al.  Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies , 2006, IEEE Data Eng. Bull..

[14]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[15]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[16]  Panagiotis Takis Metaxas,et al.  Vocal Minority Versus Silent Majority: Discovering the Opionions of the Long Tail , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[17]  Yuandong Tian,et al.  Learning from crowds in the presence of schools of thought , 2012, KDD.

[18]  Shuai Ma,et al.  Data Quality Problems beyond Consistency and Deduplication , 2013, In Search of Elegance in the Theory and Practice of Computation.

[19]  Felix Naumann,et al.  Conflict Handling Strategies in an Integrated Information System , 2006 .

[20]  Guoliang Li,et al.  Incremental Quality Inference in Crowdsourcing , 2014, DASFAA.

[21]  Murat Demirbas,et al.  Crowdsourcing for Multiple-Choice Question Answering , 2014, AAAI.

[22]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[23]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[24]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[25]  Jiawei Han,et al.  A Probabilistic Model for Estimating Real-valued Truth from Conflicting Sources , 2012 .

[26]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[27]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[28]  DAVID G. KENDALL,et al.  Introduction to Mathematical Statistics , 1947, Nature.

[29]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[30]  Dan Roth,et al.  Latent credibility analysis , 2013, WWW.

[31]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[32]  Dan Roth,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Making Better Informed Trust Decisions with Generalized Fact-Finding , 2022 .

[33]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[34]  Tom Minka,et al.  How To Grade a Test Without Knowing the Answers - A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing , 2012, ICML.

[35]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[36]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[37]  Charu C. Aggarwal,et al.  Mining collective intelligence in diverse groups , 2013, WWW.

[38]  Divesh Srivastava,et al.  Data Fusion: Resolving Conflicts from Multiple Sources , 2013, WAIM.