Provenance Based Conflict Handling Strategies

A fundamental task in data integration is data fusion, the process of fusing multiple records representing the same real-world object into a consistent representation; data fusion involves the resolution of possible conflicts between data coming from different sources; several high level strategies to handle inconsistent data have been described and classified in [8]. The MOMIS Data Integration System [2] uses either conflict avoiding strategies (such as the trust your friends strategy which takes the value of a preferred source) and resolution strategies (such as the meet in the middle strategy which takes an average value). In this paper we consider other strategies proposed in literature to handle inconsistent data and we discuss how they can be adopted and extended in the MOMIS Data Integration System. First of all, we consider the methods introduced by the Trio system [1,6] and based on the idea to tackle data conflicts by explicitly including information on provenance to represent uncertainty and use it to answer queries. Other possible strategies are to ignore conflicting values at the global level (i.e., only consistent values are considered) and to consider at the global level all conflicting values. The original contribution of this paper is a provenance-based framework which includes all the above mentioned conflict handling strategies and use them as different search strategies for querying the integrated sources.

[1]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[2]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[3]  Jennifer Widom,et al.  Representing uncertain data: models, properties, and algorithms , 2009, The VLDB Journal.

[4]  Domenico Beneventano,et al.  Data lineage in the MOMIS data fusion system , 2011, 2011 IEEE 27th International Conference on Data Engineering Workshops.

[5]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[6]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[7]  Helmut Seidl,et al.  Exact XML Type Checking in Polynomial Time , 2007, ICDT.

[8]  Gustavo Alonso,et al.  Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[9]  Jennifer Widom,et al.  An Introduction to ULDBs and the Trio System , 2006, IEEE Data Eng. Bull..

[10]  G. Höfner,et al.  Data integration , 1993 .

[11]  Felix Naumann,et al.  Completeness of integrated information sources , 2004, Inf. Syst..

[12]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[13]  Chen Li,et al.  Information Integration Research: Summary of NSF IDM Workshop Breakout Session , 2004 .

[14]  Jan Chomicki,et al.  Consistent Query Answering: Five Easy Pieces , 2007, ICDT.

[15]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[16]  Maurizio Vincini,et al.  Synthesizing an Integrated Ontology , 2003, IEEE Internet Comput..

[17]  Silvana Castano,et al.  Semantic integration of heterogeneous information sources , 2001, Data Knowl. Eng..

[18]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.