Sailing the Information Ocean with Awareness of Currents: Discovery and Application of Source Dependence

The Web has enabled the availability of a huge amount of useful information, but has also eased the ability to spread false information and rumors across multiple sources, making it hard to distinguish between what is true and what is not. Recent examples include the premature Steve Jobs obituary, the second bankruptcy of United airlines, the creation of Black Holes by the operation of the Large Hadron Collider, etc. Since it is important to permit the expression of dissenting and conflicting opinions, it would be a fallacy to try to ensure that the Web provides only consistent information. However, to help in separating the wheat from the chaff, it is essential to be able to determine dependence between sources. Given the huge number of data sources and the vast volume of conflicting data available on the Web, doing so in a scalable manner is extremely challenging and has not been addressed by existing work yet. In this paper, we present a set of research problems and propose some preliminary solutions on the issues involved in discovering dependence between sources. We also discuss how this knowledge can benefit a variety of technologies, such as data integration and Web 2.0, that help users manage and access the totality of the available information from various sources.

[1]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[2]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[3]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[4]  Robert L. Winkler,et al.  Limits for the Precision and Value of Information from Dependent Sources , 1985, Oper. Res..

[5]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[6]  Laura M. Haas,et al.  Beauty and the Beast: The Theory and Practice of Information Integration , 2007, ICDT.

[7]  Simon French,et al.  Updating of Belief in the Light of Someone Else's Opinion , 1980 .

[8]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[9]  Felix Naumann,et al.  Conflict Handling Strategies in an Integrated Information System , 2006 .

[10]  James Cheney,et al.  Curated databases , 2008, PODS.

[11]  Gustavo Alonso,et al.  Databases and Web 2.0 panel at VLDB 2007 , 2008, SGMD.

[12]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[13]  WeikumGerhard,et al.  Databases and Web 2.0 panel at VLDB 2007 , 2008 .

[14]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[15]  Peter Buneman,et al.  Provenance in databases , 2009, SIGMOD '07.

[16]  Dennis V. Lindley,et al.  Reconciliation of Probability Distributions , 1983, Oper. Res..

[17]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .