Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation

In many applications, one can obtain descriptions about the same objects or events from a variety of sources. As a result, this will inevitably lead to data or information conflicts. One important problem is to identify the true information (i.e., the truths) among conflicting sources of data. It is intuitive to trust reliable sources more when deriving the truths, but it is usually unknown which one is more reliable a priori. Moreover, each source possesses a variety of properties with different data types. An accurate estimation of source reliability has to be made by modeling multiple properties in a unified model. Existing conflict resolution work either does not conduct source reliability estimation, or models multiple properties separately. In this paper, we propose to resolve conflicts among multiple sources of heterogeneous data types. We model the problem using an optimization framework where truths and source reliability are defined as two sets of unknown variables. The objective is to minimize the overall weighted deviation between the truths and the multi-source observations where each source is weighted by its reliability. Different loss functions can be incorporated into this framework to recognize the characteristics of various data types, and efficient computation approaches are developed. Experiments on real-world weather, stock and flight data as well as simulated multi-source data demonstrate the necessity of jointly modeling different data types in the proposed framework.

[1]  Gjergji Kasneci,et al.  CoBayes: bayesian knowledge corroboration with assessors of unknown areas of expertise , 2011, WSDM '11.

[2]  Zhengrui Jiang A Decision-Theoretic Framework for Numerical Attribute Value Reconciliation , 2012, IEEE Transactions on Knowledge and Data Engineering.

[3]  Amélie Marian,et al.  Corroborating Information from Web Sources , 2011, IEEE Data Eng. Bull..

[4]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[5]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[6]  Dan Roth,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Making Better Informed Trust Decisions with Generalized Fact-Finding , 2022 .

[7]  Jiawei Han,et al.  A Probabilistic Model for Estimating Real-valued Truth from Conflicting Sources , 2012 .

[8]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[9]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[10]  Charu C. Aggarwal,et al.  Mining collective intelligence in diverse groups , 2013, WWW.

[11]  Raul Poler,et al.  Non-Linear Programming , 2014 .

[12]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[13]  Dan Roth,et al.  Content-driven trust propagation framework , 2011, KDD.

[14]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[15]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[16]  Lorenzo Blanco,et al.  Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources , 2010, CAiSE.

[17]  Elisa Bertino,et al.  An Approach to Evaluate Data Trustworthiness Based on Data Provenance , 2008, Secure Data Management.

[18]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[19]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[20]  Werner Kießling,et al.  Corroborating Information from Web Sources. , 2011 .

[21]  Felix Naumann,et al.  Conflict Handling Strategies in an Integrated Information System , 2006 .

[22]  穂鷹 良介 Non-Linear Programming の計算法について , 1963 .

[23]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[24]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[25]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[26]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..