Discovering and using semantics for database schemas

This dissertation studies the problem of discovering and using semantics for structured and semi-structured data, such as relational databases and XML documents. Semantics is captured in terms of mappings from a database schema to conceptual schemas/ontologies. Data semantics lies at the heart of data integration—the problem of sharing data across disparate sources. To address this problem, database researchers have proposed a host of solutions including federated databases, data warehousing, mediator-wrapper-based data integration systems, peer-to-peer data management systems, and more recently data spaces. In the Semantic Web community, the solution to the problem of providing machine understandable data for better web-wide information retrieval and exchange is to annotate web data using formal domain ontologies. A central issue in all of these solutions is the problem of capturing the semantics of the data to be integrated. This dissertation describes our solutions for discovering semantics for data and using the semantics to facilitate the discovery of schema mappings. First, we develop a semi-automatic tool, MAPONTO, for discovering semantics for a database schema in terms of a given conceptual model (hereafter CM). The tool takes as inputs a relational or XML database schema, a CM covering the same domain as the database, and a set of simple element correspondences from schema elements to datatype properties in the CM. It then generates a set of logical formulas that define a mapping from the schema to the CM. The key is to align the integrity constraints in the schema with the semantic constructs in the CM, guided by standard database design principles. Second, we extend MAPONTO with a semantic approach to finding schema mapping expressions. The approach leverages the semantics of schemas expressed in terms of CMs. We present experimental results demonstrating that MAPONTO saves significant human effort in discovering the semantics of database schemas and it outperforms the traditional mapping techniques for building complex schema mapping expressions in terms of both recall and precision. The development of MAPONTO provides a suite of practical tools for recovering semantics for database-resident data and generating improved schema mapping results for data integration.

[1]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[2]  Heiner Stuckenschmidt,et al.  Ontology-Based Integration of Information - A Survey of Existing Approaches , 2001, OIS@IJCAI.

[3]  Catriel Beeri,et al.  Ontology-Based Integration of XML Web Resources , 2002, SEMWEB.

[4]  Peishen Qi,et al.  Ontology Translation on the Semantic Web , 2003, OTM.

[5]  Veda C. Storey,et al.  Reverse Engineering of Relational Databases: Extraction of an EER Model from a Relational Database , 1994, Data Knowl. Eng..

[6]  Nicola Guarino,et al.  Sweetening Ontologies with DOLCE , 2002, EKAW.

[7]  Surajit Chaudhuri,et al.  On the equivalence of recursive and nonrecursive datalog programs , 1992, J. Comput. Syst. Sci..

[8]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[9]  Philip A. Bernstein,et al.  A vision for management of complex models , 2000, SGMD.

[10]  Alon Y. Halevy,et al.  Why Your Data Won’t Mix , 2005, ACM Queue.

[11]  Gerd Stumme,et al.  FCA-MERGE: Bottom-Up Merging of Ontologies , 2001, IJCAI.

[12]  Shamkant B. Navathe,et al.  Abstracting Relational and Hierarchical Data with a Semantic Data Model , 1987, International Conference on Conceptual Modeling.

[13]  Kevin Chen-Chuan Chang,et al.  Automatic complex schema matching across Web query interfaces: A correlation mining approach , 2006, TODS.

[14]  Ashok K. Chandra,et al.  Optimal implementation of conjunctive queries in relational data bases , 1977, STOC '77.

[15]  Pedro M. Domingos,et al.  Learning to map between ontologies on the semantic web , 2002, WWW '02.

[16]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[17]  Erhard Rahm,et al.  Data Warehouse Scenarios for Model Management , 2000, ER.

[18]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[19]  Zoubida Kedad,et al.  Discovering view expressions from a multi-source information system , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[20]  Serge Abiteboul,et al.  Complexity of answering queries using materialized views , 1998, PODS.

[21]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[22]  Pedro M. Domingos,et al.  Learning to map between structured representations of data , 2002 .

[23]  Craig A. Knoblock,et al.  Query reformulation for dynamic information integration , 1996, Journal of Intelligent Information Systems.

[24]  John Mylopoulos,et al.  Building Semantic Mappings from Databases to Ontologies , 2006, AAAI.

[25]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[26]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[27]  Alain Pirotte,et al.  The Semantics of Reifying n-ary Relationships as Classes , 2002, ICEIS.

[28]  Laura M. Haas,et al.  Data-driven understanding and refinement of schema mappings , 2001, SIGMOD '01.

[29]  Paul Johannesson,et al.  A method for transforming relational schemas into conceptual schemas , 1989, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[30]  Jean-Marc Petit,et al.  Using Queries to Improve Database Reverse Engineering , 1994, ER.

[31]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[32]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[33]  Vipul Kashyap,et al.  Observer: an approach for query processing in global information systems based on interoperation across pre-existing ontologies , 1996, Proceedings First IFCIS International Conference on Cooperative Information Systems.

[34]  William A. Woods,et al.  What's in a Link: Foundations for Semantic Networks , 1975 .

[35]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[36]  François Goasdoué,et al.  The Use of CARIN Language and Algorithms for Information Integration: The PICSEL System , 2000, Int. J. Cooperative Inf. Syst..

[37]  Anthony C. Klug On conjunctive queries containing inequalities , 1988, JACM.

[38]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2005, Theor. Comput. Sci..

[39]  Martin Andersson Extracting an Entity Relationship Schema from a Relational Database through Reverse Engineering , 1994, ER.

[40]  Mark A. Musen,et al.  The PROMPT suite: interactive tools for ontology merging and mapping , 2003, Int. J. Hum. Comput. Stud..

[41]  Sudhir K. Arora,et al.  Schema Translation Using the Entity-Relationship Approach , 1981, ER.

[42]  W. Alex Gray,et al.  An Approach to Eliciting the Semantics of Relational Databases , 1992, CAiSE.

[43]  Ravi Krishnamurthy,et al.  Language features for interoperability of databases with schematic discrepancies , 1991, SIGMOD '91.

[44]  John Mylopoulos,et al.  Inferring Complex Semantic Mappings Between Relational Tables and Ontologies from Simple Correspondences , 2005, OTM Conferences.

[45]  Arie Shoshani,et al.  Representing extended entity-relationship structures in relational databases: a modular approach , 1992, TODS.

[46]  John Mylopoulos,et al.  Data Semantics Revisited , 2004, SWDB.

[47]  Paolo Papotti,et al.  Nested mappings: schema mapping reloaded , 2006, VLDB.

[48]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[49]  Diego Calvanese,et al.  Ontology of Integration and Integration of Ontologies , 2001, Description Logics.

[50]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[51]  Stefano Spaccapietra,et al.  View Integration: A Step Forward in Solving Structural Conflicts , 1994, IEEE Trans. Knowl. Data Eng..

[52]  Dan Suciu,et al.  Schema mediation in peer data management systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[53]  John Mylopoulos,et al.  Discovering the Semantics of Relational Tables Through Mappings , 2006, J. Data Semant..

[54]  Il-Yeol Song,et al.  Analysis of Binary/Ternary Cardinality Combinations in Entity-Relationship Modeling , 1996, Data Knowl. Eng..

[55]  Erhard Rahm,et al.  Supporting executable mappings in model management , 2005, SIGMOD '05.

[56]  Christine Collet,et al.  Resource integration using a large knowledge base in Carnot , 1991, Computer.

[57]  Alon Y. Halevy,et al.  Introduction to the special issue on semantic integration , 2004, SGMD.

[58]  Amit P. Sheth,et al.  Data Semantics: What, Where, and How? , 1995 .

[59]  Vincent Englebert,et al.  Program Understanding in Databases Reverse Engineering , 1998, DEXA.

[60]  James A. Larson,et al.  A Theory of Attribute Equivalence in Databases with Application to Schema Integration , 1989, IEEE Trans. Software Eng..

[61]  Ronald J. Brachman,et al.  An overview of the KL-ONE Knowledge Representation System , 1985 .

[62]  David W. Embley,et al.  Using Domain Ontologies to Discover Direct and Indirect Matches for Schema Elements , 2003 .

[63]  Philip A. Bernstein,et al.  Applying Model Management to Classical Meta Data Problems , 2003, CIDR.

[64]  Yannis Kalfoglou,et al.  Ontology mapping: the state of the art , 2003, The Knowledge Engineering Review.

[65]  Divesh Srivastava,et al.  Data model and query evaluation in global information systems , 1995, Journal of Intelligent Information Systems.

[66]  Won Kim,et al.  On resolving schematic heterogeneity in multidatabase systems , 1995, Distributed and Parallel Databases.

[67]  François Goasdoué,et al.  Answering queries using views: A KRDB perspective for the semantic Web , 2004, TOIT.

[68]  Ronald Fagin,et al.  Data exchange: getting to the core , 2003, PODS '03.

[69]  Andrea Calì,et al.  Data integration under integrity constraints , 2004, Inf. Syst..

[70]  Renée J. Miller,et al.  Mapping Adaptation under Evolving Schemas , 2003, VLDB.

[71]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[72]  Yannis Kalfoglou,et al.  Centre for Intelligent Systems and Their Applications , 2006 .

[73]  Diego Calvanese,et al.  The Description Logic Handbook , 2007 .

[74]  Jun Zhang,et al.  Simlarity Search for Web Services , 2004, VLDB.

[75]  Divesh Srivastava,et al.  Answering Queries Using Views. , 1999, PODS 1995.

[76]  John Mylopoulos,et al.  Constructing Complex Semantic Mappings Between XML Data and Ontologies , 2005, SEMWEB.

[77]  Dongwon Lee,et al.  Constraints-Preserving Transformation from XML Document Type Definition to Relational Schema , 2000, ER.

[78]  Laura M. Haas,et al.  Towards heterogeneous multimedia information systems: the Garlic approach , 1995, Proceedings RIDE-DOM'95. Fifth International Workshop on Research Issues in Data Engineering-Distributed Object Management.

[79]  Ronald Fagin,et al.  Inverting schema mappings , 2006, TODS.

[80]  Alon Y. Halevy,et al.  MiniCon: A scalable algorithm for answering queries using views , 2000, The VLDB Journal.

[81]  Jérôme Euzenat,et al.  A Survey of Schema-Based Matching Approaches , 2005, J. Data Semant..

[82]  Ramon C. Barquin,et al.  Planning and Designing the Data Warehouse , 1996 .

[83]  Phokion G. Kolaitis Schema mappings, data exchange, and metadata management , 2005, PODS '05.

[84]  Zoubida Kedad,et al.  Mapping Discovery for XML Data Integration , 2005, OTM Conferences.

[85]  Il-Yeol Song,et al.  Ternary relationship decomposition strategies based on binary imposition rules , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[86]  Vipul Kashyap,et al.  So Far (Schematically) yet So Near (Semantically) , 1992, DS-5.

[87]  Peter P. Chen The entity-relationship model: toward a unified view of data , 1975, VLDB '75.

[88]  Carsten Kleiner,et al.  Automatic Generation of XML DTDs from Conceptual Database Schemas , 2001, GI Jahrestagung.

[89]  Jeffrey D. Ullman,et al.  Information integration using logical views , 1997, Theor. Comput. Sci..

[90]  Luciano Serafini,et al.  Semantic Coordination: A New Approach and an Application , 2003, SEMWEB.

[91]  Philip A. Bernstein,et al.  Composition of mappings given by embedded dependencies , 2005, PODS '05.

[92]  John Mylopoulos,et al.  Information Modeling in the Time of the Revolution , 1998, Inf. Syst..

[93]  Jayant Madhavan,et al.  Composing Mappings Among Data Sources , 2003, VLDB.

[94]  Matthias Jarke,et al.  Telos: representing knowledge about information systems , 1990, TOIS.

[95]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[96]  Jennifer Widom,et al.  Research problems in data warehousing , 1995, CIKM '95.

[97]  Cong Yu,et al.  Semantic Adaptation of Schema Mappings when Schemas Evolve , 2005, VLDB.

[98]  Fernando Brito e Abreu,et al.  Clustering relations into abstract ER schemas for database reverse engineering , 2002, Sci. Comput. Program..

[99]  Paul G. Sorenson,et al.  Resolving the query inference problem using Steiner trees , 1984, TODS.

[100]  Ronald Fagin,et al.  Translating Web Data , 2002, VLDB.

[101]  Mark A. Musen,et al.  Anchor-PROMPT: Using Non-Local Context for Semantic Matching , 2001, OIS@IJCAI.

[102]  Johann A. Makowsky,et al.  Identifying Extended Entity-Relationship Object Structures in Relational Schemas , 1990, IEEE Trans. Software Eng..

[103]  Pedro M. Domingos,et al.  Representing and reasoning about mappings between domain models , 2002, AAAI/IAAI.

[104]  Xiaolei Qian,et al.  Query folding , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[105]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[106]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[107]  Philip A. Bernstein,et al.  Merging Models Based on Given Correspondences , 2003, VLDB.

[108]  Alon Y. Halevy,et al.  Semantic Integration , 2005, AI Mag..

[109]  Erhard Rahm,et al.  Rondo: a programming platform for generic model management , 2003, SIGMOD '03.

[110]  Alon Y. Halevy,et al.  Theory of answering queries using views , 2000, SGMD.

[111]  Ronald Fagin,et al.  Composing schema mappings: second-order dependencies to the rescue , 2004, PODS 2004.

[112]  Christian Soutou,et al.  Extracting N-ary Relationships Through Database Reverse Engineering , 1996, ER.

[113]  David W. Embley,et al.  Discovering direct and indirect matches for schema elements , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[114]  Chen Li,et al.  Answering queries using views with arithmetic comparisons , 2002, PODS '02.

[115]  Christian Soutou,et al.  Inference of Aggregate Relationships through Database Reverse Engineering , 1998, ER.

[116]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[117]  Alon Y. Halevy,et al.  Piazza: data management infrastructure for semantic web applications , 2003, WWW '03.

[118]  Cong Yu,et al.  Constraint-based XML query rewriting for data integration , 2004, SIGMOD '04.

[119]  David W. Embley,et al.  An approach to schema integration and query formulation in federated database systems , 1987, 1987 IEEE Third International Conference on Data Engineering.

[120]  Ioana Manolescu,et al.  Answering XML Queries on Heterogeneous Data Sources , 2001, VLDB.

[121]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[122]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[123]  Laks V. S. Lakshmanan,et al.  HePToX: Marrying XML and Heterogeneity in Your P2P Databases , 2005, VLDB.

[124]  Vipul Kashyap,et al.  InfoSleuth: agent-based semantic integration of information in open and dynamic environments , 1997, SIGMOD '97.

[125]  Mihalis Yannakakis,et al.  Equivalences Among Relational Expressions with the Union and Difference Operators , 1980, J. ACM.

[126]  John Mylopoulos,et al.  Incorporating Goal Analysis in Database Design: A Case Study from Biological Data Management , 2006, 14th IEEE International Requirements Engineering Conference (RE'06).

[127]  Todd D. Millstein,et al.  Navigational Plans For Data Integration , 1999, AAAI/IAAI.

[128]  Renée J. Miller,et al.  The Use of Information Capacity in Schema Integration and Translation , 1993, VLDB.

[129]  Alon Y. Halevy,et al.  Recursive Query Plans for Data Integration , 2000, J. Log. Program..

[130]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[131]  Holger Knublauch,et al.  The Protégé OWL Plugin: An Open Development Environment for Semantic Web Applications , 2004, SEMWEB.

[132]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[133]  Laks V. S. Lakshmanan,et al.  Interoperability on XML Data , 2003, International Semantic Web Conference.

[134]  John Mylopoulos,et al.  A Semantic Approach to Discovering Schema Mapping Expressions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.