Eliminating NULLs with Subsumption and Complementation

In a data integration process, an important step after schema matching and duplicate detection is data fusion. It is concerned with the combination or merging of different representations of one real-world object into a single, consistent representation. In order to solve potential data conflicts, many different conflict resolution strategies can be applied. In particular, some representations might contain missing values (NULL-values) where others provide a non-NULL-value. A common strategy to handle such NULL-values, is to replace them with the existing values from other representations. Thus, the conciseness of the representation is increased without losing information. Two examples for relational operators that implement such a strategy are minimum union and complement union and their unary building blocks subsumption and complementation. In this paper, we define and motivate the use of these operators in data integration, consider them as database primitives, and show how to perform optimization of query plans in presence of subsumption and complementation with rule-based plan transformations. 1 Data Fusion as Part of Data Integration Data integration can be seen as a three-step process consisting of schema matching, duplicate detection and data fusion. Schema matching is concerned with the resolution of schematic conflicts, for instance through schema matching and schema mapping techniques. Next, duplicate detection is concerned with resolving conflicts at object level, in particular detecting two (or more!) representations of same real-world objects, called duplicates. For instance, considering two data sources describing persons, schema matching determines that the concatenation of the attributes firstname and lastname in Source 1 is semantically equivalent to the attribute name in Source 2. Duplicate detection then recognizes that the entry John M. Smith in Source 1 represents the same person as the entry J. M. Smith in Source 2. This article focuses on the step that succeeds both schema matching and duplicate detection, namely data fusion. This final step combines different representations of the same real-world object (previously identified Copyright 2011 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering