PORSCHE: Performance ORiented SCHEma mediation

Semantic matching of schemas in heterogeneous data sharing systems is time consuming and error prone. Existing mapping tools employ semi-automatic techniques for mapping two schemas at a time. In a large-scale scenario, where data sharing involves a large number of data sources, such techniques are not suitable. We present a new robust automatic method which discovers semantic schema matches in a large set of XML schemas, incrementally creates an integrated schema encompassing all schema trees, and defines mappings from the contributing schemas to the integrated schema. Our method, PORSCHE (Performance ORiented SCHEma mediation), utilises a holistic approach which first clusters the nodes based on linguistic label similarity. Then it applies a tree mining technique using node ranks calculated during depth-first traversal. This minimises the target node search space and improves performance, which makes the technique suitable for large-scale data sharing. The PORSCHE framework is hybrid in nature and flexible enough to incorporate more matching techniques or algorithms. We report on experiments with up to 80 schemas containing 83,770 nodes, with our prototype implementation taking 587s on average to match and merge them, resulting in an integrated schema and returning mappings from all input schemas to the integrated schema. The quality of matching in PORSCHE is shown using precision, recall and F-measure on randomly selected pairs of schemas from the same domain. We also discuss the integrity of the mediated schema in the light of completeness and minimality measures.

[1]  Kevin Chen-Chuan Chang,et al.  A holistic paradigm for large scale schema matching , 2004, SGMD.

[2]  Ana Carolina Salgado,et al.  Information Quality Measurement in Data Integration Schemas , 2007, QDB.

[3]  Hiroki Arimura,et al.  Discovering Frequent Substructures in Large Unordered Trees , 2003, Discovery Science.

[4]  Kevin Chen-Chuan Chang,et al.  Mining semantics for large scale integration on the web: evidences, insights, and challenges , 2004, SKDD.

[5]  Dan Suciu,et al.  Schema mediation in peer data management systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[6]  Anant Jhingran Enterprise information mashups: integrating information, simply , 2006, VLDB.

[7]  Jiawei Han,et al.  Discovering complex matchings across web query interfaces: a correlation mining approach , 2004, KDD.

[8]  Erhard Rahm,et al.  Matching large XML schemas , 2004, SGMD.

[9]  Tadeusz Pankowski,et al.  Data Merging in Life Science Data Integration Systems , 2005, Intelligent Information Systems.

[10]  DoanAnHai,et al.  Learning to match ontologies on the Semantic Web , 2003, VLDB 2003.

[11]  Philip A. Bernstein,et al.  Adapting a generic match algorithm to align ontologies of human anatomy , 2004, Proceedings. 20th International Conference on Data Engineering.

[12]  Willem Jonker,et al.  Using Element Clustering to Increase the Efficiency of XML Schema Matching , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[13]  Jérôme Euzenat,et al.  A Survey of Schema-Based Matching Approaches , 2005, J. Data Semant..

[14]  Joachim Hammer,et al.  Element matching across data-oriented XML sources using a multi-strategy clustering model , 2004, Data Knowl. Eng..

[15]  Erhard Rahm,et al.  Matching large schemas: Approaches and evaluation , 2007, Inf. Syst..

[16]  Mohammed J. Zaki Efficiently Mining Frequent Embedded Unordered Trees , 2004, Fundam. Informaticae.

[17]  Fausto Giunchiglia,et al.  S-Match: an Algorithm and an Implementation of Semantic Matching , 2004, ESWS.

[18]  Avigdor Gal,et al.  A framework for modeling and evaluating automatic semantic reconciliation , 2005, The VLDB Journal.

[19]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[20]  Erhard Rahm,et al.  Rondo: a programming platform for generic model management , 2003, SIGMOD '03.

[21]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[22]  Pedro M. Domingos,et al.  Learning to match ontologies on the Semantic Web , 2003, The VLDB Journal.

[23]  Erhard Rahm,et al.  Comparison of Schema Matching Evaluations , 2002, Web, Web-Services, and Database Systems.

[24]  Maguelonne Teisseire,et al.  Where's Charlie: family based heuristics for peer-to-peer schema integration , 2004, Proceedings. International Database Engineering and Applications Symposium, 2004. IDEAS '04..

[25]  Philip A. Bernstein,et al.  Industrial-strength schema matching , 2004, SGMD.

[26]  Erhard Rahm,et al.  Schema and ontology matching with COMA++ , 2005, SIGMOD '05.

[27]  Steffen Staab,et al.  QOM - Quick Ontology Mapping , 2004, GI Jahrestagung.

[28]  Zohra Bellahsene,et al.  XBenchMatch: a Benchmark for XML Schema Matching Tools , 2007, VLDB.

[29]  Ju Wang,et al.  An Experiment on the Matching and Reuse of XML Schemas , 2005, ICWE.

[30]  Pavel Shvaiko,et al.  A Classification of Schema-Based Matching Approaches , 2004 .

[31]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[32]  Lukasz A. Kurgan,et al.  Semantic Mapping of XML Tags Using Inductive Machine Learning , 2002, ICMLA.

[33]  Weifeng Su,et al.  Holistic Query Interface Matching using Parallel Schema Matching , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[34]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.