Datenintegration

Data integration aims at combining data that resides in distributed, autonomous, and heterogeneous databases into a single consistent view of the data. Its applications are abundant, ranging from data integration in scientific data (helping scientists share, understand, reuse and complement past results) to data integration in enterprises (for instance to set up data warehouses, perform business intelligence or implement master data management) and data integration on the Web (comparison online shopping or linking open data). In order to achieve data integration, three major problems have been of particular interest to both the database research community and IT industry. First, the heterogeneity between data models and schemas of data sources has to be overcome. Second, data sources may overlap in the sets of real-world entities such as persons or products they represent and their multiple and usually different representations of a same entity, so called duplicates, need to be identified. Finally, in the integrated result, every entity should be represented exactly once, so duplicates need to be merged to a single representation, a problem also referred to as data fusion. This special issue covers some solutions and new challenges related to data integration, both from a research and an industrial perspective. The article by Mecca and Papotti describes the state-ofthe art of schema mapping and data exchange solutions, employed to address the first of the above steps by bridging the heterogeneity between data models and schemas of data sources to be integrated. Schema mapping techniques have acquired great popularity due to their declarative nature, clean semantics, easy to use design tools and their efficiency and modularity in the deployment step. The article divides the surveyed approaches into three ages: the heroic age that produced the theoretical foundations and early tools, the silver age when schema mapping tools have grown their way into complex systems and have been translated into both commercial and open-source tools, and a forthcoming golden age with novel research opportunities and a new generation of systems capable of dealing with a significantly larger class of real-life applications. Maier, Oberhofer and Schwarz describe a commercial approach to data integration that addresses the three main problems above (among others) and that has been used in large data integration projects worldwide. This approach is based on the observation that the largest part of a typical data integration effort is dedicated to the implementation of transformation, cleansing, and data validation logic in robust and highly performing commercial systems. This effort is simple and does not demand skills beyond commercial product knowledge, but it is very labour-intensive and error prone. Their approach helps to industrialize data integration projects and significantly lowers the amount of simple, but labourintensive work. The key idea is that the target landscape for a data integration project has pre-defined data models and associated meta data which can be leveraged for building and automating the data integration process. In her article on multi-scale data integration, BertiEquille presents some challenging research directions for integrating massive multi-scale scientific data from the observational science domain. This data is intensively collected in order to measure various properties of the Earth. For instance, scientists observe environmental conditions, ecosystems, or biological species. The ability to understand complex phenomena such as global warming and to predict trends from spatio-temporal data have become a major issue in observational science, for which theoretical and technical advances in multi-scale data integration are essential. The paper describes several use cases of data integration in observational sciences and outlines challenge due to temporal, spatial, structural, semantic or analytic dependencies, different levels of data granularity or data abstraction from raw measurement data to processed data and derived statistics, various data interpretations or usages depending on the disciplines, quality heterogeneity of spatio-temporal data, and scaling issues.