Data veracity estimation with ensembling truth discovery methods

Estimation of data veracity is recognized as one of the grand challenges of big data. Typically, the goal of truth discovery is to determine the veracity of multi-source, conflicting data and return, as outputs, a veracity label and a confidence score for each data value, along with the trustworthiness score of each source claiming it. Although a plethora of methods has been proposed, it is unlikely a technique dominates all others across all data sets. Furthermore, the performance evaluation of the methods entirely depends on the availability of labeled ground truth data (i.e., data whose veracity has been manually checked). In the context of Big Data, acquiring the complete ground truth data is out-of-reach. In this paper, we propose an ensembling method that mitigates the two problems of method selection and ground truth data sparsity. Our approach combines the results of a set of truth discovery methods and preliminary experiments suggest that it improves the quality performance over the single methods when samples of ground truth data are used.

[1]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[2]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..

[3]  Dalia Attia Waguih,et al.  Truth Discovery Algorithms: An Experimental Evaluation QCRI Technical Report, May 2014 , 2014 .

[4]  Divesh Srivastava,et al.  Data quality: The other face of Big Data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[5]  Bo Zhao,et al.  On the Discovery of Evolving Truth , 2015, KDD.

[6]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[7]  Lina Yao,et al.  Approximate Truth Discovery via Problem Scale Reduction , 2015, CIKM.

[8]  Ian Davidson,et al.  Reverse testing: an efficient framework to select amongst classifiers under sample selection bias , 2006, KDD '06.

[9]  Dan Roth,et al.  Latent credibility analysis , 2013, WWW.

[10]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[11]  Taylor Cassidy,et al.  The Wisdom of Minority: Unsupervised Slot Filling Validation based on Multi-dimensional Truth-Finding , 2014, COLING.

[12]  Jianzhong Li,et al.  Incremental Truth Discovery for Information from Multiple Data Sources , 2013, WAIM Workshops.

[13]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[14]  Tarek F. Abdelzaher,et al.  On truth discovery in social sensing: A maximum likelihood estimation approach , 2012, International Symposium on Information Processing in Sensor Networks.

[15]  Bo Zhao,et al.  Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective , 2015, Proc. VLDB Endow..

[16]  Divesh Srivastava,et al.  Fusing data with correlations , 2014, SIGMOD Conference.

[17]  Bo Zhao,et al.  A Confidence-Aware Approach for Truth Discovery on Long-Tail Data , 2014, Proc. VLDB Endow..

[18]  Wei Zhang,et al.  From Data Fusion to Knowledge Fusion , 2014, Proc. VLDB Endow..

[19]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[20]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[21]  Wilfred Ng,et al.  Truth Discovery in Data Streams: A Single-Pass Probabilistic Approach , 2014, CIKM.

[22]  Divesh Srivastava,et al.  Scaling up copy detection , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[23]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[24]  Xin Dong Solomon: seeking the truth via copying detection , 2011, BEWEB.

[25]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[26]  Laure Berti-Équille,et al.  Truth Discovery Algorithms: An Experimental Evaluation , 2014, ArXiv.

[27]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[28]  Heng Ji,et al.  Modeling Truth Existence in Truth Discovery , 2015, KDD.

[29]  Heng Ji,et al.  FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation , 2015, KDD.