Finding Global Optimum for Truth Discovery: Entropy Based Geometric Variance

Truth Discovery is an important problem arising in data analytics related fields such as data mining, database, and big data. It concerns about finding the most trustworthy information from a dataset acquired from a number of unreliable sources. Due to its importance, the problem has been extensively studied in recent years and a number techniques have already been proposed. However, all of them are of heuristic nature and do not have any quality guarantee. In this paper, we formulate the problem as a high dimensional geometric optimization problem, called Entropy based Geometric Variance. Relying on a number of novel geometric techniques (such as Log-Partition and Modified Simplex Lemma), we further discover new insights to this problem. We show, for the first time, that the truth discovery problem can be solved with guaranteed quality of solution. Particularly, we show that it is possible to achieve a (1+eps)-approximation within nearly linear time under some reasonable assumptions. We expect that our algorithm will be useful for other data related applications.

[1]  Piotr Indyk,et al.  Geometric matching under noise: combinatorial bounds and algorithms , 1999, SODA '99.

[2]  HanJiawei,et al.  A confidence-aware approach for truth discovery on long-tail data , 2014, VLDB 2014.

[3]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[4]  Bo Zhao,et al.  Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation , 2014, SIGMOD Conference.

[5]  Bo Zhao,et al.  The wisdom of minority: discovering and targeting the right group of workers for crowdsourcing , 2014, WWW.

[6]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[7]  Sudipto Guha,et al.  Sublinear estimation of entropy and information distances , 2009, TALG.

[8]  Jinhui Xu,et al.  A Unified Framework for Clustering Constrained Data without Locality Property , 2015, SODA.

[9]  Edo Liberty,et al.  The Mailman algorithm: A note on matrix-vector multiplication , 2009, Inf. Process. Lett..

[10]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[11]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[12]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[13]  Bo Zhao,et al.  A Confidence-Aware Approach for Truth Discovery on Long-Tail Data , 2014, Proc. VLDB Endow..

[14]  Herbert Edelsbrunner,et al.  Cutting dense point sets in half , 1997, Discret. Comput. Geom..

[15]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Bo Zhao,et al.  A Survey on Truth Discovery , 2015, SKDD.

[17]  Xue Liu,et al.  Generalized Decision Aggregation in Distributed Sensing Systems , 2014, 2014 IEEE Real-Time Systems Symposium.

[18]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[19]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[20]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .