Providing an integrated access to multiple heterogeneous sources is a challenging issue in global information systems for cooperation and interoperability. In this context, two fundamental problems arise. First, how to determine if the sources contain semantically related information, that is, information related to the same or similar real-world concept(s). Second, how to handle semantic heterogeneity to support integration and uniform query interfaces. Complicating factors with respect to conventional view integration techniques are related to the fact that the sources to be integrated already exist and that semantic heterogeneity occurs on the large-scale, involving terminology, structure, and context of the involved sources, with respect to geographical, organizational, and functional aspects related to information use. Moreover, to meet the requirements of global, Internet-based information systems, it is important that tools developed for supporting these activities are semi-automatic and scalable as much as possible.
The goal of this paper is to describe the MOMIS [4, 5] (Mediator envirOnment for Multiple Information Sources) approach to the integration and query of multiple, heterogeneous information sources, containing structured and semistructured data. MOMIS has been conceived as a joint collaboration between University of Milano and Modena in the framework of the INTERDATA national research project, aiming at providing methods and tools for data management in Internet-based information systems. Like other integration projects [1, 10, 14], MOMIS follows a “semantic approach” to information integration based on the conceptual schema, or metadata, of the information sources, and on the following architectural elements: i) a common object-oriented data model, defined according to the ODL<subscrpt><italic>I</italic><supscrpt>3</supscrpt></subscrpt> language, to describe source schemas for integration purposes. The data model and ODL<subscrpt><italic>I</italic><supscrpt>3</supscrpt></subscrpt> have been defined in MOMIS as subset of the ODMG-93 ones, following the proposal for a standard mediator language developed by the <italic>I</italic><supscrpt>3</supscrpt>/POB working group [7]. In addition, ODL<subscrpt><italic>I</italic><supscrpt>3</supscrpt></subscrpt> introduces new constructors to support the semantic integration process [4, 5]; ii) one or more wrappers, to translate schema descriptions into the common ODL<subscrpt><italic>I</italic><supscrpt>3</supscrpt></subscrpt> representation; iii) a mediator and a query-processing component, based on two pre-existing tools, namely ARTEMIS [8] and ODB-Tools [3] (available on Internet at http://sparc20.dsi.unimo.it/), to provide an <italic>I</italic><supscrpt>3</supscrpt> architecture for integration and query optimization. In this paper, we focus on capturing and reasoning about semantic aspects of schema descriptions of heterogeneous information sources for supporting integration and query optimization. Both semistructured and structured data sources are taken into account [5]. A Common Thesaurus is constructed, which has the role of a shared ontology for the information sources. The Common Thesaurus is built by analyzing ODL<subscrpt><italic>I</italic><supscrpt>3</supscrpt></subscrpt> descriptions of the sources, by exploiting the Description Logics OLCD (Object Language with Complements allowing Descriptive cycles) [2, 6], derived from KL-ONE family [17]. The knowledge in the Common Thesaurus is then exploited for the identification of semantically related information in ODL<subscrpt><italic>I</italic><supscrpt>3</supscrpt></subscrpt> descriptions of different sources and for their integration at the global level. Mapping rules and integrity constraints are defined at the global level to express the relationships holding between the integrated description and the sources descriptions. ODB-Tools, supporting OLCD and description logic inference techniques, allows the analysis of sources descriptions for generating a consistent Common Thesaurus and provides support for semantic optimization of queries at the global level, based on defined mapping rules and integrity constraints.
[1]
Craig A. Knoblock,et al.
Retrieving and Integrating Data from Multiple Information Sources
,
1993,
Int. J. Cooperative Inf. Syst..
[2]
James G. Schmolze,et al.
The KL-ONE family
,
1992
.
[3]
Silvana Castano,et al.
Semantic dictionary design for database interoperability
,
1997,
Proceedings 13th International Conference on Data Engineering.
[4]
Richard Hull,et al.
Managing semantic heterogeneity in databases: a theoretical prospective
,
1997,
PODS.
[5]
Louiqa Raschid,et al.
Mediator languages—a proposal for a standard: report of an I3/POB working group held at the University of Maryland, April 12 and 13, 1996
,
1997,
SGMD.
[6]
Domenico Beneventano,et al.
Consistency Checking in Complex Object Database Schemata with Integrity Constraints
,
1995,
DBPL.
[7]
George A. Miller,et al.
WordNet: A Lexical Database for English
,
1995,
HLT.
[8]
Maurizio Vincini,et al.
ODB-Tools: A Description Logics Based Tool for Schema Validation and Semantic Query Optimization in Object Oriented Databases
,
1997,
AI*IA.
[9]
Louiqa Raschid,et al.
Mediator Languages - a Proposal for a Standard
,
1997,
SIGMOD Rec..
[10]
Silvana Castano,et al.
An intelligent approach to information integration
,
1998
.
[11]
Vipul Kashyap,et al.
Domain Specific Ontologies for Semantic Information Brokering on the Global Information Infrastructure
,
1998
.
[12]
Mary Roth,et al.
Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources
,
1997,
VLDB.
[13]
Jeffrey D. Ullman,et al.
Information integration using logical views
,
1997,
Theor. Comput. Sci..
[14]
Joann J. Ordille,et al.
Querying Heterogeneous Information Sources Using Source Descriptions
,
1996,
VLDB.