Information Brokering Across Heterogeneous Digital Data

ion Level Incompatibility In this case, heterogeneities arise due to differing levels of abstraction at which the entities may be represented (Section 1.4). These heterogeneities can be resolved by mapping the entities to concepts at appropriate levels of abstraction in the domain ontology. In case there do not exist appropriate concepts, one may have to construct c-context from concepts in the domain ontology. In case the concepts are from different ontologies, one may have to define terminological relationships across them. Examples of terminological relationships are hyponyms/hypernyms in the case of generalization/specialization and holonyms/meronyms in the case of aggregations. Schematic Discrepancies In this case, heterogeneities arise when data in one database corresponds to metadata in another (Section 1.5). One form of this heterogeneity is the attribute entity conflict. It can be resolved by mapping corresponding entities and attributes to appropriate c-contexts. For other forms of this heterogeneity, it is necessary to have a mechanism to specify correspondences between data in one database and metadata in another, and is beyond the scope of this book. 108 INFORMATION BROKERING We have discussed above how schematic details and heterogeneities can be abstracted out by using c-contexts associated with mapping expressions and transformer functions. The c-contexts so constructed are also used to capture the information content. From the perspective of information brokering, they may also be viewed as an intermediate language in which information content of the underlying databases is represented. The two perspectives based on which the c-contexts may be constructed are as follows. Bottom-Up Perspective In this case, the focus is on abstracting out the representational and schematic details. Thus, c-contexts are used as views on objects in the underlying databases, and the set of instances exported to the information broker on the GII obeys the view constraints. This is the perspective primarily followed in (Kashyap and Sheth, 1996). Top-Down Perspective In this case, the focus is on modeling and specifying information in an application or domain specific manner. Thus, it is assumed that there exist underlying objects in the databases for concepts in the ontologies. Mappings are then appropriately combined to determine the object instances in the underlying databases that satisfy the constraints specified in the c-contexts. This perspective is taken in (Mena et al., 1996b). A similar perspective has been taken in (Borgida and Brachman, 1993) for populating description logic (DL) expressions. 2.2 C-CONTEXTS: A PARTIAL REPRESENTATION Several efforts attempt to represent the similarity between two objects in databases. In (Larson et al., 1989), a fixed set of descriptors define essential characteristics of attributes, and are used to generate mappings between them. We have discussed in (Kashyap and Sheth, 1996), how the descriptors do not guarantee semantic similarity. Thus, any representation of c-context which can be described by a fixed set of descriptors is not appropriate. In our approach, the descriptors (or meta-attributes) are chosen dynamically to model characteristics of the application domain. It is not possible a priori to determine all possible meta-attributes that would completely characterize the semantics of an application domain. This leads to a partial representation of c-contexts. We represent a c-context as a collection of contextual coordinates (meta-attributes) as follows: Context = <(C1, Expr1) (C2, Expr2) ... (Ck, Exprk) > where Ci, 1 ≤ i ≤ k, is a contextual coordinate denoting an aspect of a c-context Ci may model some characteristic of the subject domain and may be obtained from a domain specific ontology (discussed later in this section) Ci may model an implicit assumption in the design of a database. Capturing Information Content in Structured Data 109 Now, we explain the meaning of the symbols Ci and Expri by using examples and by enumerating the corresponding DL expressions. When using DL expressions, it is possible to define primitive classes and in addition, specify classes using intensional descriptions phrased in terms of necessary and sufficient properties that must be satisfied by their instances. The intensional descriptions may be used to express collection of constraints that make up a c-context. Using the terminology of DL systems, each term may be modeled as either a concept or a role. Also, each Ci roughly corresponds to a role, and each Expri roughly corresponds to fillers for that role. Expri might be a term, c-context, or a term associated with a c-context. Heuristics for modeling terms as contextual coordinates or their values are discussed later in this section. The DL expressions corresponding to c-contexts are summarized in Appendix 5.A. We use the following example and terminology to explain how c-contexts capture information in the databases using terms from a domain ontology. Consider the following database objects: EMPLOYEE(SS#, Name, Salary Type, Dept, Affiliation) PUBLICATION(ld, Title, Journal) POSITION(ld, Title, Dept, Type) HAS-PUBLICATION(SS#, Id) HOLDS-POSITION(SS#, Id) Let us now illustrate with examples how information content in these database objects can be captured with the help of terms organized as c-contexts in a domain specific ontology. Some relevant terminology is as follows. ■ term(O) and term(A) are terms corresponding to the database object O and attribute A at the intensional level. We assume the existence of transformer functions between the domains of the terms (also referred to as the extension of the term) in the ontology, and the domains of the appropriate object or attribute in the database. ■ instance(V) is the instance corresponding to the data value V in the database. The data value might be a key or an object identifier. This might be implemented using a transformer function between the domains of the term to which the instance belongs in the ontology, and the domain of the appropriate object or attribute in the database. ■ Ext(Term) denotes the set of instances corresponding to the term in the ontology. The predicate term should have one more argument identifying the ontology which is being used, as a database might contain information in more than one information domain. However, we can assume without loss of generality that one ontology is being used to capture the information in this database. 110 INFORMATION BROKERING ■ Cdef(O) is the definition context of a database object O and is typically used to specify assumptions in the design of the object. It may also be used to share a pre-determined extension of the object with the GII (denoted as OG) . ■ O1 o Cass (O1, O2) denotes the association of an object O1 with an association context. This may be used to represent relationships between the objects O1 and O2 with reference to an aspect of the application domain. ■ Cq denotes the context associated with a query Q posed to an information broker on the GII. The context makes explicit (partially) the semantics of the query. A user can consult concepts in ontologies and objects in a database to construct the query context. We can identify the following associations: term(EMPLOYEE) = EmplConcept, term(EMPLOYEE.SS#) = EmplConcept.self, term(EMPLOYEE.Name) = name, term(EMPLOYEE.Dept) = hasEmployer, term(EMPLOYEE.Affiliation) = hasAffiliation, term(PUBLICATI0N) = PublConcept, term(PUBLICATION.ld) = { hasArticle, PublConcept.self } term(PUBLICATION.Title) = hasTitle, term(POSITION) = PostConcept, term(POSITION.ld) = { hasPosition, PostConcept.self } term(HAS-PUBLICATION) = HasPublConcept, term(HAS-PUBLICATION.Id) = { hasArticle, isAuthorOf } term(HAS-PUBLICATION.SS#) = hasAuthor, term(HOLDS-POSITION) = HoldsPostConcept, term(HOLDS-POSITION.SS#) = hasDesignee, term(HOLDS-POSITION.Id) = { hasposition, isDesigneeOf } The value Expri of a contextual coordinate Ci can be represented in the following manner. ■ Expri can be a variable. It is used as a place holder to elicit answers from the databases and impose constraints on them. Example: Suppose, we are interested in people who are authors and who hold a position (designee). We can represent the query context Cq as follows: Cq = <(isAuthorOf, X) (isDesigneeOf, Y)> For a detailed exposition about the various types of context see (Kashyap, 1997). Capturing Information Content in Structured Data 111 The same thing can be expressed in a DL as follows: Cq = (AND Anything (ATLEAST 1 isAuthorOf) (ATLEAST 1 isDesigneeOf)). The terms isAuthorOf and isDesigneeOf are obtained from a domain specific ontology. From a modeling perspective, the above query expresses the users’ interest in all employees that hold a position and have authored a published article. In this particular case, it can be seen intuitively that objects that are instances of EmplConcept are the right candidates. This can be expressed in the following manner. Cq = (AND EmplConcept (ATLEAST 1 isAuthorOf) (ATLEAST 1 isDesigneeOf)) It may be noted here that we use variables in a very restricted manner for the specific purpose of retrieving relevant properties of the selected objects. They are used only at the highest level of nesting though the c-contexts can have an arbitrary level of nesting (since each Expri can be a c-context or a term associated with a c-context), and hence we do not need to perform complex nested unifications. ■ Expri can be a set. – The set may be an enumeration of terms from a domain specific ontology. The set may be defined as the extension of an object or as elements from the domain of a type defined in the database. The set may be defined by posing constraints on pre-existing sets. –

[1]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[2]  Vipul Kashyap,et al.  InfoSleuth: Semantic Integration of Information in Open and Dynamic Environments (Experience Paper) , 1997, SIGMOD Conference.

[3]  ShethAmit,et al.  Semantic and schematic similarities between database objects: a context-based approach , 1996, VLDB 1996.

[4]  Anant Jhingran A Performance Study of Query Optimization Algorithms on a Database System Supporting Procedures , 1988, VLDB.

[5]  Eric N. Hanson Processing queries aganist database procedures: a performance analysis , 1988, SIGMOD '88.

[6]  Dennis McLeod,et al.  A federated architecture for information management , 1985, TOIS.

[7]  Michael Stonebraker,et al.  EXTENDING A DATA BASE SYSTEM WITH PROCEDURES , 1985 .

[8]  Matthias Klusch,et al.  Intelligent Information Agents , 1999, Springer Berlin Heidelberg.

[9]  Jungyun Seo,et al.  Classifying schematic and data heterogeneity in multidatabase systems , 1991, Computer.

[10]  Vipul Kashyap,et al.  Semantic Information Brokering: How Can a Multi-agent Approach Help? , 1999, CIA.

[11]  Marti A. Hearst,et al.  Metadata for mixed-media access , 1994, SGMD.

[12]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[13]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[14]  James A. Larson,et al.  A Theory of Attribute Equivalence in Databases with Application to Schema Integration , 1989, IEEE Trans. Software Eng..

[15]  John McCarthy,et al.  Notes on Formalizing Context , 1993, IJCAI.

[16]  Ravi Krishnamurthy,et al.  Language features for interoperability of databases with schematic discrepancies , 1991, SIGMOD '91.

[17]  Peter Schäuble,et al.  Metadata for integrating speech documents in a text retrieval system , 1994, SGMD.

[18]  David W. Embley,et al.  An approach to schema integration and query formulation in federated database systems , 1987, 1987 IEEE Third International Conference on Data Engineering.

[19]  Richard A. Harshman,et al.  Indexing by latent semantic indexing , 1990 .

[20]  Dennis McLeod,et al.  An Approach to Resolving Semantic Heterogenity in a Federation of Autonomous, Heterogeneous Database Systems , 1993, Int. J. Cooperative Inf. Syst..

[21]  Klemens Böhm,et al.  Metadata for multimedia documents , 1994, SGMD.

[22]  Roger C. Schank,et al.  The Primitive ACTs of Conceptual Dependency , 1975, TINLAP.

[23]  Scott Piepenburg,et al.  Easy Marc: A Simplified Guide to Creating Catalog Records for Library Automation Systems : Pre-Format Integration , 1994 .

[24]  Alexander Borgida From Type Systems to Knowledge Representation: Natural Semantics Specifications for Description Logics , 1992, Int. J. Cooperative Inf. Syst..

[25]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[26]  Roger C. Schank,et al.  Conceptual dependency: A theory of natural language understanding , 1972 .

[27]  Umeshwar Dayal,et al.  View Definition and Generalization for Database Integration in a Multidatabase System , 1984, IEEE Transactions on Software Engineering.

[28]  Craig A. Knoblock,et al.  Retrieving and Integrating Data from Multiple Information Sources , 1993, Int. J. Cooperative Inf. Syst..

[29]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[30]  Yuri Breitbart,et al.  Multidatabase Interoperability , 1990, SGMD.

[31]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[32]  W ChuWesley,et al.  A Knowledge-Based Approach for Retrieving Images by Content , 1996 .

[33]  Vipul Kashyap,et al.  InfoHarness: Use of Automatically Generated Metadata for Search and Retrieval of Heterogeneous Information , 1995, CAiSE.

[34]  MAGIC' An Interface for Generating Mapping , .

[35]  Ramesh Jain,et al.  Infoscopes: Multimedia Information Systems , 1996 .

[36]  Kevin Chen-Chuan Chang,et al.  Interoperability for digital libraries worldwide , 1998, CACM.

[37]  A. Sheth,et al.  Information Brokering Across Heterogeneous Digital Data , 2000, Advances in Database Systems.

[38]  Amit P. Sheth,et al.  Using Polytransactions to Manage Interdependent Data , 1992, Database Transaction Models for Advanced Applications.

[39]  Wei Shen,et al.  Enterprise information modeling and model integration in carnot , 1992 .

[40]  Peter M. Schwarz,et al.  The Rufus System: Information Organization for Semi-Structured Data , 1993, VLDB.

[41]  Brewster Kahle,et al.  An information system for corporate users: wide area information servers , 1991 .

[42]  Gio Wiederhold,et al.  Intelligent integration of information , 1993, SIGMOD Conference.

[43]  Amit P. Sheth,et al.  Semantic interoperability in global information systems , 1999, SGMD.

[44]  R. Guha Contexts: a formalization and some applications , 1992 .

[45]  S. Misbah Deen,et al.  The Architecture of a Generalised Distributed Database System - PRECI , 1985, Comput. J..

[46]  Michael Stonebraker,et al.  On rules, procedures, caching and views in database systems , 1994, SIGMOD 1994.

[47]  Yuri Breitbart,et al.  Database integration in a distributed heterogeneous database system , 1986, 1986 IEEE Second International Conference on Data Engineering.

[48]  Michael Stonebraker,et al.  The design of POSTGRES , 1986, SIGMOD '86.

[49]  Robert M. MacGregor,et al.  A Deductive Pattern Matcher , 1988, AAAI.

[50]  Vipul Kashyap,et al.  So Far (Schematically) yet So Near (Semantically) , 1992, DS-5.

[51]  Yasushi Kiyoki,et al.  A metadatabase system for semantic image search by a mathematical model of meaning , 1994, SGMD.

[52]  Franz Baader,et al.  KRIS: Knowledge Representation and Inference System , 1991, SGAR.

[53]  Vipul Kashyap,et al.  Managing Multiple Information Sources through Ontologies: Relationship between Vocabulary Heterogeneity and Loss of Information , 1996, KRDB.

[54]  A. Illarramendi Connecting Knowledge Bases with Databases: A complete mapping relation , 1995 .

[55]  Deborah L. McGuinness,et al.  CLASSIC: a structural data model for objects , 1989, SIGMOD '89.

[56]  Gerald Salton,et al.  Automatic text processing , 1988 .

[57]  Vipul Kashyap,et al.  Semantics-based information brokering , 1994, CIKM '94.

[58]  Vipul Kashyap,et al.  Attribute-based Access of Heterogeneous Digital Data , 1995 .

[59]  Christine Collet,et al.  Resource integration using a large knowledge base in Carnot , 1991, Computer.

[60]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[61]  Amit P. Sheth,et al.  Overview on Using Metadata to manage Multimedia Data , 1998, Multimedia Data Management.

[62]  Amit P. Sheth,et al.  Specifying interdatabase dependencies in a multidatabase environment , 1991, Computer.

[63]  Vipul Kashyap,et al.  Media-independent correlation of Information: What? How? , 1996, MD.

[64]  Vipul Kashyap,et al.  Domain Specific Ontologies for Semantic Information Brokering on the Global Information Infrastructure , 1998 .

[65]  Timos K. Sellis Efficiently supporting procedures in relational database systems , 1987, SIGMOD '87.

[66]  Terry E. Weymouth,et al.  Semantic Queries with Pictures: The VIMSYS Model , 1991, VLDB.

[67]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[68]  Munindar P. Singh,et al.  Readings in agents , 1997 .

[69]  Ronald J. Brachman,et al.  An Overview of the KL-ONE Knowledge Representation System , 1985, Cogn. Sci..

[70]  Joann J. Ordille,et al.  Distributed active catalogs and meta-data caching in descriptive name services , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[71]  Alexander Borgida,et al.  Loading data into description reasoners , 1993, SIGMOD Conference.

[72]  Nicola Guarino,et al.  The Ontological Level , 1994 .

[73]  Clement T. Yu,et al.  Determining relationships among attributes for interoperability of multi-database systems , 1991, [1991] Proceedings. First International Workshop on Interoperability in Multidatabase Systems.