Truth Finding on the Deep Web: Is the Problem Solved?

The amount of useful information available on the Web has been growing at a dramatic pace in recent years and people rely more and more on the Web to fulfill their information needs. In this paper, we study truthfulness of Deep Web data in two domains where we believed data are fairly clean and data quality is important to people's lives: Stock and Flight. To our surprise, we observed a large amount of inconsistency on data from different sources and also some sources with quite low accuracy. We further applied on these two data sets state-of-the-art data fusion methods that aim at resolving conflicts and finding the truth, analyzed their strengths and limitations, and suggested promising research directions. We wish our study can increase awareness of the seriousness of conflicting data on the Web and in turn inspire more research in our community to tackle this problem.

[1]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[2]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[3]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[4]  Divesh Srivastava,et al.  Sailing the Information Ocean with Awareness of Currents: Discovery and Application of Source Dependence , 2009, CIDR.

[5]  Amélie Marian,et al.  Corroborating Answers from Multiple Web Sources , 2007, WebDB.

[6]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[7]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[8]  Divesh Srivastava,et al.  Information Theory For Data Management , 2009, Proc. VLDB Endow..

[9]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[10]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[11]  Lorenzo Blanco,et al.  Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources , 2010, CAiSE.

[12]  Amélie Marian,et al.  A framework for corroborating answers from multiple web sources , 2011, Inf. Syst..

[13]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[14]  MengWeiyi,et al.  Truth finding on the deep web , 2012, VLDB 2012.

[15]  Dan Roth,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Making Better Informed Trust Decisions with Generalized Fact-Finding , 2022 .

[16]  Xiaoxin Yin,et al.  Semi-supervised truth discovery , 2011, WWW.

[17]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[18]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[19]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..