QMatch - Using paths to match XML schemas

Integration of multiple heterogeneous data sources continues to be a critical problem for many application domains and a challenge for researchers world-wide. With the increasing popularity of the XML model and the proliferation of XML documents on-line, automated matching of XML documents and databases has become a critical issue. In this paper, we present a hybrid schema match algorithm, QMatch, that provides a unique path-based framework for harnessing traditional structural and semantic information, while exploiting the constraints inherent in XML documents such as the order of XML elements, to provide improved levels of matching between two given XML schemata. QMatch is based on the measurement of a unique quality of match metric, QoM, and a set of classifiers which together provide not only an effective basis for the development of a new schema match algorithm, but also a useful tool for tuning existing schema match algorithms to output at desired levels of matching. In this paper, we show via a set of experiments the benefits of the path-based QMatch over existing structural, linguistic, and hybrid algorithms such as Cupid, and provide an empirical measure of the accuracy of QMatch in terms of the true matches discovered by the algorithm.

[1]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[2]  Peer Kröger,et al.  A Computational Biology Database Digest: Data, Data Analysis, and Data Management , 2004, Distributed and Parallel Databases.

[3]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[4]  Laura M. Haas,et al.  Transforming Heterogeneous Data with Database Middleware: Beyond Integration , 1999, IEEE Data Eng. Bull..

[5]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[6]  Amihai Motro,et al.  Autoplex: Automated Discovery of Content for Virtual Databases , 2001, CoopIS.

[7]  Ali R. Hurson,et al.  Automated resolution of semantic heterogeneity in multidatabases , 1994, TODS.

[8]  Bodo Rieger,et al.  Semantic Integration of Heterogeneous Information Sources , 2000, EFIS.

[9]  AnHai Doan,et al.  iMAP: Discovering Complex Mappings between Database Schemas. , 2004, SIGMOD 2004.

[10]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[11]  George A. Miller,et al.  WordNet: A Lexical Database for the English Language , 2002 .

[12]  Chris Clifton,et al.  Semantic Integration in Heterogeneous Databases Using Neural Networks , 1994, VLDB.

[13]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[14]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[15]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[16]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[17]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[18]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[19]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[20]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[21]  M. Suyama [Genome database]. , 2004, Tanpakushitsu kakusan koso. Protein, nucleic acid, enzyme.