Conflict-Aware Historical Data Fusion

Historical data reports on numerous events for overlapping time intervals, locations, and names. As a result, it may include severe data conflicts caused by database redundancy that prevent researchers from obtaining the correct answers to queries on an integrated historical database. In this paper, we propose a novel conflict-aware data fusion strategy for historical data sources. We evaluated our approach on a large-scale data warehouse that integrates historical data from approximately 50,000 reports on US epidemiological data for more than 100 years. We demonstrate that our approach significantly reduces data aggregation error in the integrated historical database.

[1]  Richard T. Snodgrass,et al.  Developing Time-Oriented Database Applications in SQL , 1999 .

[2]  Stephen J Senn,et al.  Overstating the evidence – double counting in meta-analysis and related problems , 2009, BMC Medical Research Methodology.

[3]  Andrew B. Whinston,et al.  Model management , 1994 .

[4]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[5]  Jan Chomicki,et al.  Specifying and Querying Database Repairs using Logic Programs with Exceptions , 2000, FQAS.

[6]  Phokion G. Kolaitis,et al.  Repair checking in inconsistent databases: algorithms and complexity , 2009, ICDT '09.

[7]  Philip A. Bernstein,et al.  Model management 2.0: manipulating richer mappings , 2007, SIGMOD '07.

[8]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[9]  C. J. Date,et al.  Temporal data and the relational model , 2002 .

[10]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  Jan Chomicki,et al.  Computing consistent query answers using conflict hypergraphs , 2004, CIKM '04.

[12]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[13]  Sergio Greco,et al.  Active Integrity Constraints for Database Consistency Maintenance , 2009, IEEE Transactions on Knowledge and Data Engineering.

[14]  Laura M. Haas,et al.  Beauty and the Beast: The Theory and Practice of Information Integration , 2007, ICDT.

[15]  François Bry,et al.  Query Answering in Information Systems with Integrity Constraints , 1997, IICIS.

[16]  Jef Wijsen,et al.  Consistent query answering under primary keys: a characterization of tractable queries , 2009, ICDT '09.

[17]  Helmut Seidl,et al.  Exact XML Type Checking in Polynomial Time , 2007, ICDT.

[18]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[19]  Jan Chomicki,et al.  Consistent query answers in the presence of universal constraints , 2008, Inf. Syst..

[20]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[21]  Filippo Furfaro,et al.  Consistent Query Answers on Numerical Databases Under Aggregate Constraints , 2005, DBPL.

[22]  Filippo Furfaro,et al.  Querying and repairing inconsistent numerical databases , 2010, TODS.

[23]  Leopoldo E. Bertossi,et al.  Consistent query answering in databases , 2006, SGMD.

[24]  Gunter Saake,et al.  Logics for Emerging Applications of Databases , 2003, Springer Berlin Heidelberg.

[25]  Sebastian Maneth,et al.  Efficient Memory Representation of XML Documents , 2005, DBPL.

[26]  Christian S. Jensen,et al.  Temporal Data Management , 1999, IEEE Trans. Knowl. Data Eng..

[27]  Vladimir Zadorozhny,et al.  Scalable Catalog Infrastructure for Managing Access Costs and Source Selection in Wide Area Networks , 2008, Int. J. Cooperative Inf. Syst..

[28]  Michael L. Brodie Data Integration at Scale: From Relational Data Integration to Information Ecosystems , 2010, 2010 24th IEEE International Conference on Advanced Information Networking and Applications.

[29]  Jef Wijsen,et al.  Database repairing using updates , 2005, TODS.

[30]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[31]  Sailes K. Sengijpta Fundamentals of Statistical Signal Processing: Estimation Theory , 1995 .

[32]  Jan Chomicki,et al.  Query Answering in Inconsistent Databases , 2003, Logics for Emerging Applications of Databases.

[33]  Vladimir Zadorozhny,et al.  Efficient evaluation of queries in a mediator for WebSources , 2002, SIGMOD '02.

[34]  Gio Wiederhold,et al.  Flexible relation: an approach for integrating data from multiple, possibly inconsistent databases , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[35]  Michael L. Brodie Chapter 74 – Data Management Challenges in Very Large Enterprises , 2002, VLDB 2002.

[36]  Vladimir Zadorozhny,et al.  AReNA: Adaptive Distributed Catalog Infrastructure Based On Relevance Networks , 2005, VLDB.

[37]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[38]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.