Element matching across data-oriented XML sources using a multi-strategy clustering model

We describe a family of heuristics-based clustering strategies to support the merging of XML data from multiple sources. As part of this research, we have developed a comprehensive classification for schematic and semantic conflicts that can occur when reconciling related XML data from multiple sources. Given the fact that element clustering is compute-intensive, especially when comparing large numbers of data elements that exhibit great representational diversity, performance is a critical, yet so far neglected aspect of the merging process. We have developed five heuristics for clustering data in the multi-dimensional metric space. Equivalence of data elements within the individual clusters is determined using several distance functions that calculate the semantic distances among the elements.The research described in this article is conducted within the context of the Integration Wizard (IWIZ) project at the University of Florida. IWIZ enables users to access and retrieve information from multiple XML-based sources through a consistent, integrated view. The results of our qualitative analysis of the clustering heuristics have validated the feasibility of our approach as well as its superior performance when compared to other similarity search techniques.

[1]  Michael Lesk How Can We Get High-Quality Electronic Journals? , 1998 .

[2]  Peter Fankhauser,et al.  IRO-DB An Object-Oriented Approach towards Federated and Interoperable DBMS (Invited Paper) , 1994, ADBIS.

[3]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[4]  Anna Teterovskaya,et al.  CONFLICT DETECTION AND RESOLUTION DURING RESTRUCTURING OF XML DATA , 2000 .

[5]  Michael R. Genesereth,et al.  Infomaster - An Information Integration Tool , 1997 .

[6]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[7]  Dan Suciu,et al.  STRUDEL: a Web site management system , 1997, SIGMOD '97.

[8]  Michael R. Genesereth,et al.  Infomaster: an information integration system , 1997, SIGMOD '97.

[9]  Rajesh Kanna,et al.  MANAGING XML DATA IN A RELATIONAL WAREHOUSE: ON QUERY TRANSLATION, WAREHOUSE MAINTENANCE, AND DATA STALENESS , 2001 .

[10]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[11]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[12]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[13]  Christos Faloutsos,et al.  Searching Multimedia Databases by Content , 1996, Advances in Database Systems.

[14]  Laks V. S. Lakshmanan,et al.  SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems , 1996, VLDB.

[15]  Joachim Hammer,et al.  A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources , 2000 .

[16]  Vipul Kashyap,et al.  Semantic and Schematic Similarities between Objects in Databases A Context based approach , 1995 .

[17]  William Kent The many forms of a single fact , 1989, Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage.

[18]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[19]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[20]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[21]  Joachim Hammer The Information Integration Wizard (IWiz) Project , 1999 .

[22]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[23]  Sophie Cluet,et al.  Querying XML Documents in Xyleme , 2000, SIGIR 2000.

[24]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[25]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[26]  Ramasubramanian Ramani A TOOLKIT FOR MANAGING XML DATA WITH A RELATIONAL DATABASE MANAGEMENT SYSTEM , 2001 .

[27]  Nagiza F. Samatova,et al.  RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets , 2002, Distributed and Parallel Databases.

[28]  Peter Fankhauser,et al.  IRO-DB: Making Relational and Object-Oriented Database Systems Interoperable , 1996, EDBT.

[29]  Amit Shah SOURCE SPECIFIC QUERY REWRITING AND QUERY PLAN GENERATION FOR MERGING XML-BASED SEMISTRUCTURED DATA IN MEDIATION SYSTEMS , 2001 .

[30]  Jennifer Widom,et al.  Maintenance of Materialized Views: Problems, Techniques, and Applications , 1999, IEEE Data Eng. Bull..

[31]  Renée J. Miller Using schematically heterogeneous structures , 1998, SIGMOD '98.

[32]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[33]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[34]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[35]  Alfons Kemper,et al.  ObjectGlobe: Ubiquitous query processing on the Internet , 2001, The VLDB Journal.

[36]  Umeshwar Dayal,et al.  View Definition and Generalization for Database Integration in a Multidatabase System , 1984, IEEE Transactions on Software Engineering.

[37]  Sumit Sarkar,et al.  Entity matching in heterogeneous databases: a distance-based decision model , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[38]  Arnon Rosenthal,et al.  Using semantic values to facilitate interoperability among heterogeneous information systems , 1994, TODS.

[39]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[40]  Christos Faloutsos,et al.  The TV-tree: An index structure for high-dimensional data , 1994, The VLDB Journal.

[41]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[42]  Joachim Hammer,et al.  Overview of the Integration Wizard Project for Querying and Managing Semistructured Data in Heterogeneous Sources , 2001 .

[43]  Peter Fankhauser,et al.  IRO-DB An object-oriented approach towards federated and interoperable DBMS 1 , 2000 .

[44]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[45]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[46]  Edie M. Rasmussen,et al.  Clustering Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[47]  Jennifer Widom,et al.  The WHIPS prototype for data warehouse creation and maintenance , 1997, SIGMOD '97.

[48]  Hannes Werthner,et al.  Integration of Heterogeneous Information Sources , 2000 .

[49]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[50]  Chaitanya K. Baru,et al.  XML-based information mediation with MIX , 1999, SIGMOD '99.

[51]  Joachim Hammer,et al.  A new hierarchical clustering model for speeding up the reconciliation of xml-based, semistructured data in mediation systems , 2001 .

[52]  Umeshwar Dayal,et al.  View Definition and Generalization for Database Integration in Multibase: A System for Heterogeneous Distributed Databases , 1982, Berkeley Workshop.

[53]  Christos Faloutsos,et al.  Packed R-trees Using Fractals , 1998 .

[54]  William Kent Solving Domain Mismatch and Schema Mismatch Problems with an Object-Oriented Database Programming Language , 1991, VLDB.

[55]  Silvana Castano,et al.  Semantic integration of semistructured and structured data sources , 1999, SGMD.

[56]  Wilhelm Hasselbring,et al.  The OASIS multidatabase prototype , 1999, SGMD.

[57]  Christian S. Jensen Review - R-Trees: A Dynamic Index Structure for Spatial Searching , 1999, ACM SIGMOD Digit. Rev..

[58]  Sumit Sarkar,et al.  A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases , 2002, IEEE Trans. Knowl. Data Eng..

[59]  Sumit Sarkar,et al.  A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases , 1998 .

[60]  Stuart E. Madnick,et al.  A Metadata Approach to Resolving Semantic Conflicts , 2011, VLDB.

[61]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..