Interactive and modular design of schema mappings

A primordial task in information integration is to specify the relationships, called schema mappings, between database schemas. One of the fundamental applications of schema mappings is to specify how data structured under a source schema is to be transformed into data structured under a target schema. Since schemas that occur in real life are typically large and heterogeneous, designing schema mappings is an error-prone, laborious, and time consuming process. This dissertation studies a novel "divide-and-merge" paradigm for schema mapping creation. Our framework allows a design task to be divided into smaller components that are easier to create and understand. Each of the component schema mappings can be designed independently, through an interactive process driven by data examples. To complete the design process, the novel MapMerge schema mapping operator can be used to automatically generate a meaningful overall mapping by correlating the specifications given by the individual mapping components. Specifically, my thesis explores how to facilitate the process of designing each schema mapping through data examples and how to assemble the independent schema mappings into a global mapping. To design a schema mapping, the user can provide a set of data examples, each representing a partial specification of the semantics of the desired schema mapping. Based on such a set of data examples, the proposed techniques construct a schema mapping specified by Global-and-Local-As-View constraints that "fits" the data examples, if such mapping exists. Furthermore, system generated data examples can be used to guide the user through a schema mapping refinement process, focusing on specific components of a mapping specification, such as the design of grouping semantics, or the choice of the desired interpretation in the case of ambiguous mappings. The flows of independently designed schema mappings can then be automatically orchestrated into larger, semantically richer schema mappings through the novel MapMerge schema mapping operator. The key idea behind MapMerge is the reuse of mapping behavior from more general mappings to more specific mappings. MapMerge allows for the modular construction of complex mappings from various types of smaller mappings, such as schema correspondences produced by a schema matcher or pre-existing mappings that were designed by a human user (for instance, through the proposed techniques based on data examples), or via more traditional mapping tools. It was shown experimentally that MapMerge improves the quality of the schema mappings in terms of preserving data associations from the input source instance to the generated target instance. Finally, a novel benchmark was used to assess the relative merits of existing mapping-design systems. The findings of this benchmark on a set of commercial and research systems confirmed the high costs of traditional schema mapping design in terms of time and effort, and thus provided further motivation for the alternative schema mapping design methodology proposed in this dissertation.