Conceptual schema matching based on similarity heuristics

Leme, Luiz Andre Portes Paes; Casanova, Marco Antonio. Conceptual schema matching based on similarity heuristics. Rio de Janeiro, 2009. 106p. DSc Thesis – Department of Informatics – Pontifical Catholic University of Rio de Janeiro. Schema matching is a fundamental issue in many database applications, such as query mediation, database integration, catalog matching and data warehousing. In this thesis, we first address how to match catalogue schemas. A catalogue is a simple database that holds information about a set of objects, typically classified using terms taken from a given thesaurus. We introduce a matching approach, based on the notion of similarity, which applies to pairs of thesauri and to pairs of lists of properties. We then describe matchings based on cooccurrence of information and introduce variations that explore certain heuristics. Lastly, we discuss experimental results that evaluate the precision of the matchings introduced and that measure the influence of the heuristics. We then focus on the more complex problem of matching two schemas that belong to an expressive OWL dialect. We adopt an instance-based approach and, therefore, assume that a set of instances from each schema is available. We first decompose the problem of OWL schema matching into the problem of vocabulary matching and the problem of concept mapping. We also introduce sufficient conditions guaranteeing that a vocabulary matching induces a correct concept mapping. Next, we describe an OWL schema matching technique based on the notion of similarity. Lastly, we evaluate the precision of the technique using data available on the Web. Unlike any of the previous instance-based techniques, the matching process we describe uses similarity functions to induce vocabulary matchings in a non-trivial way, coping with an expressive OWL dialect. We also illustrate, through a set of examples, that the structure of OWL schemas may lead to incorrect concept mappings and indicate how to avoid such pitfalls.

[1]  R. Payne Geographic names information system , 1983 .

[2]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[3]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[5]  Marco A. Casanova,et al.  Database Conceptual Schema Matching , 2007, Computer.

[6]  Marco A. Casanova,et al.  A Mediator for Heterogeneous Gazetteers , 2007 .

[7]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[8]  Antonio L. Furtado,et al.  Database Mediation Using Multi-agent Systems , 2008, 2008 32nd Annual IEEE Software Engineering Workshop.

[9]  James Frew,et al.  Geographic Names: The Implementation of a Gazetteer in a Georeferenced Digital Library , 1999, D Lib Mag..

[10]  Renée J. Miller,et al.  Leveraging data and structure in ontology integration , 2007, SIGMOD '07.

[11]  Marco A. Casanova,et al.  Matching object catalogues , 2008, Innovations in Systems and Software Engineering.

[12]  Horst M. Eidenberger,et al.  Visual similarity measurement with the feature contrast model , 2003, IS&T/SPIE Electronic Imaging.

[13]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[14]  Wei-Ying Ma,et al.  Instance-based Schema Matching for Web Databases by Domain-specific Query Probing , 2004, VLDB.

[15]  Ruy Luiz Milidiú,et al.  Mediation as Recommendation: An Approach to Design Mediators for Object Catalogs , 2006, OTM Workshops.

[16]  Myoung-Ho Kim,et al.  Information Retrieval Based on Conceptual Distance in is-a Hierarchies , 1993, J. Documentation.

[17]  Antonio L. Furtado,et al.  Instance-Based OWL Schema Matching , 2009, ICEIS.

[18]  Hong Tang,et al.  Similarity Measures for Satellite Images with Heterogeneous Contents , 2007, 2007 Urban Remote Sensing Joint Event.

[19]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[20]  Ruy Luiz Milidiú,et al.  Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach , 2006, GEOINFO.

[21]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[22]  Marco A. Casanova,et al.  An Instance-based Approach for Matching Export Schemas of Geographical Database Web Services , 2007, GEOINFO.

[23]  Horst M. Eidenberger,et al.  Evaluation and analysis of similarity measures for content-based visual information retrieval , 2006, Multimedia Systems.

[24]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[25]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[26]  H. Lan,et al.  SWRL : A semantic Web rule language combining OWL and ruleML , 2004 .

[27]  Philip A. Bernstein,et al.  Model management 2.0: manipulating richer mappings , 2007, SIGMOD '07.

[28]  Amos Tversky,et al.  Studies of similarity , 1978 .

[29]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[30]  Felix Naumann,et al.  Schema matching using duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[31]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[32]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[33]  Antonio L. Furtado,et al.  Evaluation of Similarity Measures and Heuristics for Simple RDF Schema Matching , 2008 .

[34]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[35]  Marco A. Casanova,et al.  Adaptative Matching of Database Web Services Export Schemas , 2008, ICEIS.

[36]  Zohra Bellahsene,et al.  XBenchMatch: a Benchmark for XML Schema Matching Tools , 2007, VLDB.