Truth Discovery via Exploiting Implications from Multi-Source Data

Data veracity is a grand challenge for various tasks on the Web. Since the web data sources are inherently unreliable and may provide conflicting information about the same real-world entities, truth discovery is emerging as a countermeasure of resolving the conflicts by discovering the truth, which conforms to the reality, from the multi-source data. A major challenge related to truth discovery is that different data items may have varying numbers of true values (or multi-truth), which counters the assumption of existing truth discovery methods that each data item should have exactly one true value. In this paper, we address this challenge by exploiting and leveraging the implications from multi-source data. In particular, we exploit three types of implications, namely the implicit negative claims, the distribution of positive/negative claims, and the co-occurrence of values in sources' claims, to facilitate multi-truth discovery. We propose a probabilistic approach with improvement measures that incorporate the three implications in all stages of truth discovery process. In particular, incorporating the negative claims enables multi-truth discovery, considering the distribution of positive/negative claims relieves truth discovery from the impact of sources' behavioral features in the specific datasets, and considering values' co-occurrence relationship compensates the information lost from evaluating each value in the same claims individually. Experimental results on three real-world datasets demonstrate the effectiveness of our approach.

[1]  Heng Ji,et al.  Modeling Truth Existence in Truth Discovery , 2015, KDD.

[2]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[3]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[4]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[5]  Dan Roth,et al.  Latent credibility analysis , 2013, WWW.

[6]  Jiawei Han,et al.  A Probabilistic Model for Estimating Real-valued Truth from Conflicting Sources , 2012 .

[7]  Xiaoxin Yin,et al.  Semi-supervised truth discovery , 2011, WWW.

[8]  Quan Z. Sheng,et al.  The Uncertain Web , 2015, ACM Trans. Internet Techn..

[9]  Bo Zhao,et al.  Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation , 2014, SIGMOD Conference.

[10]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[11]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[12]  Ciprian Dobre,et al.  Intelligent services for Big Data science , 2014, Future Gener. Comput. Syst..

[13]  Lina Yao,et al.  Approximate Truth Discovery via Problem Scale Reduction , 2015, CIKM.

[14]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2008, IEEE Trans. Knowl. Data Eng..

[15]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[16]  Lance Kaplan,et al.  On truth discovery in social sensing: A maximum likelihood estimation approach , 2012, 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN).

[17]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[18]  Bo Zhao,et al.  A Confidence-Aware Approach for Truth Discovery on Long-Tail Data , 2014, Proc. VLDB Endow..

[19]  Lina Yao,et al.  An Integrated Bayesian Approach for Effective Multi-Truth Discovery , 2015, CIKM.

[20]  Bo Zhao,et al.  A Survey on Truth Discovery , 2015, SKDD.

[21]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[22]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..