Truth Discovery Algorithms: An Experimental Evaluation QCRI Technical Report, May 2014

A fundamental problem in data fusion is to determine the veracity of multi-source data in order to resolve conflicts. While previous work in truth discovery has proved to be useful in practice for specific settings, sources’ behavior or data set characteristics, there has been limited systematic comparison of the competing methods in terms of efficiency, usability, and repeatability. We remedy this deficit by providing a comprehensive review of 12 state-of-the art algorithms for truth discovery. We provide reference implementations and an in-depth evaluation of the methods based on extensive experiments on synthetic and real-world data. We analyze aspects of the problem that have not been explicitly studied before, such as the impact of initialization and parameter setting, convergence, and scalability. We provide an experimental framework for extensively comparing the methods in a wide range of truth discovery scenarios where source coverage, numbers and distributions of conflicts, and true positive claims can be controlled and used to evaluate the quality and performance of the algorithms. Finally, we report comprehensive findings obtained from the experiments and provide new insights for future research.

[1]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[2]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[3]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2008, IEEE Trans. Knowl. Data Eng..

[4]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[5]  Dan Roth,et al.  Content-driven trust propagation framework , 2011, KDD.

[6]  MengWeiyi,et al.  Truth finding on the deep web , 2012, VLDB 2012.

[7]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  Tarek F. Abdelzaher,et al.  On truth discovery in social sensing: A maximum likelihood estimation approach , 2012, International Symposium on Information Processing in Sensor Networks.

[9]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[10]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[11]  Laure Berti-Équille,et al.  Truth Discovery Algorithms: An Experimental Evaluation , 2014, ArXiv.

[12]  Dan Roth,et al.  Latent credibility analysis , 2013, WWW.

[13]  Bo Zhao,et al.  Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation , 2014, SIGMOD Conference.

[14]  Divesh Srivastava,et al.  Fusing data with correlations , 2014, SIGMOD Conference.

[15]  Divesh Srivastava,et al.  SOLOMON , 2010, Proc. VLDB Endow..

[16]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..

[17]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[18]  Subbarao Kambhampati,et al.  SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement , 2011, WWW.

[19]  François Goasdoué,et al.  Fact checking and analyzing the web , 2013, SIGMOD '13.

[20]  Wei Zhang,et al.  From Data Fusion to Knowledge Fusion , 2014, Proc. VLDB Endow..