Truth finding by reliability estimation on inconsistent entities for heterogeneous data sets

Abstract An important task in big data integration is to derive accurate data records from noisy and conflicting values collected from multiple sources. Most existing truth finding methods assume that the reliability is consistent on the whole data set, ignoring the fact that different attributes, objects and object groups may have different reliabilities even wrt the same source. These reliability differences are caused by the hardness differences in obtaining attribute values, non-uniform updates to objects and the differences in group privileges. This paper addresses the problem how to compute truths by effectively estimating the reliabilities of attributes, objects and object groups in a multi-source heterogeneous data environment. We first propose an optimization framework TFAR, its implementation and Lagrangian duality solution for Truth Finding by Attribute Reliability estimation. We then present a Bayesian probabilistic graphical model TFOR and an inference algorithm applying Collapsed Gibbs Sampling for Truth Finding by Object Reliability estimation. Finally we give an optimization framework TFGR and its implementation for Truth Finding by Group Reliability estimation. All these models lead to a more accurate estimation of the respective attribute, object and object group reliabilities, which in turn can achieve a better accuracy in inferring the truths. Experimental results on both real data and synthetic data show that our methods have better performance than the state-of-art truth discovery methods.

[1]  Bo Zhao,et al.  A Confidence-Aware Approach for Truth Discovery on Long-Tail Data , 2014, Proc. VLDB Endow..

[2]  Bo Zhao,et al.  Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation , 2014, SIGMOD Conference.

[3]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[4]  Tony Lindeberg,et al.  Scale Invariant Feature Transform , 2012, Scholarpedia.

[5]  Ashwin Machanavajjhala,et al.  Information integration over time in unreliable and uncertain environments , 2012, WWW.

[6]  Divesh Srivastava,et al.  Fusing data with correlations , 2014, SIGMOD Conference.

[7]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[8]  Charu C. Aggarwal,et al.  Recursive Fact-Finding: A Streaming Approach to Truth Estimation in Crowdsourcing Applications , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[9]  Lina Yao,et al.  Approximate Truth Discovery via Problem Scale Reduction , 2015, CIKM.

[10]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2008, IEEE Trans. Knowl. Data Eng..

[11]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[12]  Xiu Susie Fang Truth Discovery from Conflicting Multi-Valued Objects , 2017, WWW.

[13]  Bo Zhao,et al.  On the Discovery of Evolving Truth , 2015, KDD.

[14]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[15]  Lance Kaplan,et al.  On truth discovery in social sensing: A maximum likelihood estimation approach , 2012, 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN).

[16]  Adway Mitra,et al.  Reconciliation of categorical opinions from multiple sources , 2013, CIKM.

[17]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[18]  Bo Zhao,et al.  A Survey on Truth Discovery , 2015, SKDD.

[19]  Jianzhong Li,et al.  Incremental Truth Discovery for Information from Multiple Data Sources , 2013, WAIM Workshops.

[20]  Xiaoxin Yin,et al.  Semi-supervised truth discovery , 2011, WWW.

[21]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[22]  Wilfred Ng,et al.  Truth Discovery in Data Streams: A Single-Pass Probabilistic Approach , 2014, CIKM.

[23]  Timothy W. Finin,et al.  Taming Wild Big Data , 2014, AAAI Fall Symposia.

[24]  Lei Chen,et al.  Domain-Aware Multi-Truth Discovery from Conflicting Sources , 2018, Proc. VLDB Endow..

[25]  Cristina Dutra de Aguiar Ciferri,et al.  Incremental Data Fusion Based on Provenance Information , 2013, In Search of Elegance in the Theory and Practice of Computation.

[26]  Heng Ji,et al.  Modeling Truth Existence in Truth Discovery , 2015, KDD.

[27]  Heng Ji,et al.  FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation , 2015, KDD.

[28]  Lina Yao,et al.  An Integrated Bayesian Approach for Effective Multi-Truth Discovery , 2015, CIKM.

[29]  Divesh Srivastava,et al.  Characterizing and selecting fresh data sources , 2014, SIGMOD Conference.

[30]  Liang Ge,et al.  Multi-source deep learning for information trustworthiness estimation , 2013, KDD.

[31]  B. Bouma,et al.  Improved signal-to-noise ratio in spectral-domain compared with time-domain optical coherence tomography. , 2003, Optics letters.

[32]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[33]  Laure Berti-Équille,et al.  Truth Discovery Algorithms: An Experimental Evaluation , 2014, ArXiv.

[34]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.