Improving XML schema matching performance using Prüfer sequences

Schema matching is a critical step for discovering semantic correspondences among elements in many data-shared applications. Most of existing schema matching algorithms produce scores between schema elements resulting in discovering only simple matches. Such results partially solve the problem. Identifying and discovering complex matches is considered one of the biggest obstacle towards completely solving the schema matching problem. Another obstacle is the scalability of matching algorithms on large number and large-scale schemas. To tackle these challenges, in this paper, we propose a new XML schema matching framework based on the use of Prufer encoding. In particular, we develop and implement the XPruM system, which consists mainly of two parts-schema preparation and schema matching. First, we parse XML schemas and represent them internally as schema trees. Prufer sequences are constructed for each schema tree and employed to construct a sequence representation of schemas. We capture schema tree semantic information in Label Prufer Sequences (LPS) and schema tree structural information in Number Prufer Sequences (NPS). Then, we develop a new structural matching algorithm exploiting both LPS and NPS. To cope with complex matching discovery, we introduce the concept of compatible nodes to identify semantic correspondences across complex elements first, then the matching process is refined to identify correspondences among simple elements inside each pair of compatible nodes. Our experimental results demonstrate the performance benefits of the XPruM system.

[1]  Shirish Tatikonda,et al.  LCS-TRIM: Dynamic Programming Meets XML Indexing and Querying , 2007, VLDB.

[2]  Angela Bonifati,et al.  Schema mapping verification: the spicy way , 2008, EDBT '08.

[3]  Alon Y. Halevy,et al.  Semantic Integration Research in the Database Community : A Brief Survey , 2005 .

[4]  Zohra Bellahsene,et al.  A Flexible Approach for Planning Schema Matching Algorithms , 2008, OTM Conferences.

[5]  Xiaofeng Meng,et al.  On the sequencing of tree structures for XML indexing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[6]  Kun-Mao Chao,et al.  Spanning trees and optimization problems , 2004, Discrete mathematics and its applications.

[7]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[8]  Pedro M. Domingos,et al.  Learning to map between structured representations of data , 2002 .

[9]  Gad M. Landau,et al.  An Extension of the Vector Space Model for Querying XML Documents via XML Fragments 1 , 2002 .

[10]  Zohra Bellahsene,et al.  An Indexing Structure for Automatic Schema Matching , 2007, ICDE Workshops.

[11]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[12]  Yanchun Zhang,et al.  Web Services Discovery Based On Schema Matching , 2007, ACSC.

[13]  Joonho Kwon,et al.  Value-based predicate filtering of XML documents , 2008, Data Knowl. Eng..

[14]  AnHai Doan,et al.  Matching Schemas in Online Communities: A Web 2.0 Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[15]  Sihem Amer-Yahia,et al.  Tree Pattern Relaxation , 2002, EDBT.

[16]  Gunter Saake,et al.  A Sequence-based Ontology Matching Approach , 2008 .

[17]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[18]  Steffen Staab,et al.  QOM - Quick Ontology Mapping , 2004, GI Jahrestagung.

[19]  Erhard Rahm,et al.  Matching large schemas: Approaches and evaluation , 2007, Inf. Syst..

[20]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[21]  Richi Nayak,et al.  Fast and effective clustering of XML data using structural information , 2008, Knowledge and Information Systems.

[22]  Julie Vachon,et al.  A Context-Based Approach for the Discovery of Complex Matches Between Database Sources , 2007, DEXA.

[23]  Marc Ehrig,et al.  State of the art on ontology alignment , 2013 .

[24]  Bongki Moon,et al.  PRIX: indexing and querying XML using prufer sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[25]  Ahmed K. Elmagarmid,et al.  Usage-Based Schema Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[26]  Kevin Chen-Chuan Chang,et al.  Automatic complex schema matching across Web query interfaces: A correlation mining approach , 2006, TODS.

[27]  Gunter Saake,et al.  A New XML Schema Matching Approach Using Prüfer Sequences , 2008, DB&IS.

[28]  Rada Chirkova,et al.  Efficiently Querying Large XML Data Repositories: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[29]  Avigdor Gal,et al.  Managing Uncertainty in Schema Matching with Top-K Schema Mappings , 2006, J. Data Semant..

[30]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[31]  Carmel Domshlak,et al.  Rank Aggregation for Automatic Schema Matching , 2007, IEEE Transactions on Knowledge and Data Engineering.

[32]  Joonho Kwon,et al.  FiST: Scalable XML Document Filtering by Sequencing Twig Patterns , 2005, VLDB.

[33]  Fausto Giunchiglia,et al.  Semantic Matching: Algorithms and Implementation , 2007, J. Data Semant..

[34]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[35]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.

[36]  Hyunbo Cho,et al.  A novel method for measuring semantic similarity for XML schema matching , 2008, Expert Syst. Appl..

[37]  Avigdor Gal,et al.  A framework for modeling and evaluating automatic semantic reconciliation , 2005, The VLDB Journal.

[38]  Andre B. Bondi,et al.  Characteristics of scalability and their impact on performance , 2000, WOSP '00.

[39]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[40]  Aida Boukottaya,et al.  Schema matching for transforming structured documents , 2005, DocEng '05.

[41]  Zohra Bellahsene,et al.  PORSCHE: Performance ORiented SCHEma mediation , 2008, Inf. Syst..

[42]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[43]  Marko Smiljanic,et al.  XML schema matching : balancing efficiency and effectiveness by means of clustering , 2006 .

[44]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[45]  Erhard Rahm,et al.  Quickmig: automatic schema matching for data migration projects , 2007, CIKM '07.

[46]  Pedro M. Domingos,et al.  Ontology Matching: A Machine Learning Approach , 2004, Handbook on Ontologies.

[47]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.