Online data fusion

The Web contains a significant volume of structured data in various domains, but a lot of data are dirty and erroneous, and they can be propagated through copying. While data integration techniques allow querying structured data on the Web, they take the union of the answers retrieved from different sources and can thus return conflicting information. Data fusion techniques, on the other hand, aim to find the true values, but are designed for offline data aggregation and can take a long time. This paper proposes Solaris, the first online data fusion system. It starts with returning answers from the first probed source, and refreshes the answers as it probes more sources and applies fusion techniques on the retrieved data. For each returned answer, it shows the likelihood that the answer is correct, and stops retrieving data for it after gaining enough confidence that data from the unprocessed sources are unlikely to change the answer. We address key problems in building such a system and show empirically that the system can start returning correct answers quickly and terminate fast without sacrificing the quality of the answers.

[1]  Shazia Wasim Sadiq,et al.  Data Quality Aware Queries in Collaborative Information Systems , 2009, APWeb/WAIM.

[2]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[3]  Lorenzo Blanco,et al.  Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources , 2010, CAiSE.

[4]  Jayant Madhavan,et al.  Structured Data on the Web , 2009, 2010 12th International Asia-Pacific Web Conference.

[5]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[6]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[7]  Maria-Esther Vidal,et al.  Using Quality of Data Metadata for Source Selection and Ranking , 2000, WebDB.

[8]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[9]  Felix Naumann,et al.  Quality-Driven Query Answering for Integrated Information Systems , 2002, Lecture Notes in Computer Science.

[10]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..

[11]  Amélie Marian,et al.  A framework for corroborating answers from multiple web sources , 2011, Inf. Syst..

[12]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Alon Y. Halevy,et al.  Data integration with dependent sources , 2011, EDBT/ICDT '11.

[14]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[15]  Laure Berti-Équille,et al.  Quality Awareness for Managing and Mining Data , 2007 .

[16]  Ee-Peng Lim,et al.  Quality-aware collaborative question answering: methods and evaluation , 2009, WSDM '09.

[17]  Samir Khuller,et al.  Query Planning in the Presence of Overlapping Sources , 2006, EDBT.