Semantics-Guided Clustering of Heterogeneous XML Schemas

In this paper we illustrate an approach for clustering semantically heterogeneous XML Schemas. The proposed approach is driven by the semantics of the involved Schemas that is defined by means of the interschema properties existing among concepts represented therein; interschema properties taken into account by our approach are synonymies (indicating that two concepts have the same meaning), hyponymies (denoting that a concept has a more specific meaning than another one), and overlappings (indicating that two concepts are neither synonyms nor one hyponym of the other, but represent, to some extent, the same reality). An important feature of our approach consists of its capability of being integrated with almost all the clustering algorithms already proposed in the literature. Both a theoretical and an experimental analysis on the complexity of our approach are presented in the paper. They show that our approach is scalable and particularly suited in application contexts characterized by a great number and a large variety of XML Schemas.

[1]  Vagan Terziyan,et al.  Intelligent Information Management in Mobile Electronic Commerce , 2002 .

[2]  Riccardo Ortale,et al.  Distance-based Clustering of XML Documents , 2003 .

[3]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[5]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[6]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[7]  Elke A. Rundensteiner,et al.  MASS: a multi-axis storage structure for large XML documents , 2003, CIKM '03.

[8]  Giovanni Quattrone,et al.  Extraction of synonymies, hyponymies, overlappings and homonymies from XML schemas at various "severity" levels , 2004, Proceedings. International Database Engineering and Applications Symposium, 2004. IDEAS '04..

[9]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[10]  Silvana Castano,et al.  Global Viewing of Heterogeneous Data Sources , 2001, IEEE Trans. Knowl. Data Eng..

[11]  Yang Wen Semantic integration of structured and semistructured data sources , 2002 .

[12]  Stefano Modafferi,et al.  X-Compass: An XML Agent for Supporting User Navigation on the Web , 2002, FQAS.

[13]  Luigi Palopoli,et al.  Uniform Techniques for Deriving Similarities of Objects and Subschemes in Heterogeneous Databases , 2003, IEEE Trans. Knowl. Data Eng..

[14]  Avigdor Gal,et al.  A framework for modeling and evaluating automatic semantic reconciliation , 2005, The VLDB Journal.

[15]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[16]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  Michael I. Jordan,et al.  On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[19]  I. Jolliffe Principal Component Analysis , 2002 .

[20]  ZVI GALIL,et al.  Efficient algorithms for finding maximum matching in graphs , 1986, CSUR.

[21]  Aggelos Kiayias,et al.  Polynomial Reconstruction Based Cryptography , 2001, Selected Areas in Cryptography.

[22]  Tao Tao,et al.  Organizing structured web sources by query schemas: a clustering approach , 2004, CIKM '04.

[23]  Vijay V. Raghavan,et al.  BitCube: A Three-Dimensional Bitmap Indexing for XML Documents , 2004, Journal of Intelligent Information Systems.

[24]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[25]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[26]  Kyuseok Shim,et al.  APEX: an adaptive path index for XML data , 2002, SIGMOD '02.

[27]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[28]  Beng Chin Ooi,et al.  XR-tree: indexing XML data for efficient structural joins , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[29]  William Kwok-Wai Cheung,et al.  Integrating element and term semantics for similarity-based XML document clustering , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[30]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[31]  Long Zhang,et al.  A Two-Level Method for Clustering DTDs , 2000, Web-Age Information Management.

[32]  Gianni Costa,et al.  A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[33]  Yu Qian,et al.  A customizable hybrid approach to data clustering , 2003, SAC '03.

[34]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[35]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[36]  Erich J. Neuhold,et al.  Semantic vs. structural resemblance of classes , 1991, SGMD.

[37]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[38]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[39]  Athena Vakali,et al.  LDAP: Framework, Practices, and Trends , 2004, IEEE Internet Comput..

[40]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[41]  R. Loynes On the Concept of the Spectrum for Non‐Stationary Processes , 1968 .

[42]  A Min Tjoa,et al.  E-Commerce and Web Technologies , 2002, Lecture Notes in Computer Science.

[43]  Katherine G. Herbert,et al.  XML clustering by principal component analysis , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[44]  Athena Vakali,et al.  XML Document Indexes: A Classification , 2005, IEEE Internet Comput..

[45]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[46]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[47]  Dino Pedreschi,et al.  Knowledge Discovery in Databases: PKDD 2004 , 2004, Lecture Notes in Computer Science.

[48]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[49]  Sourav S. Bhowmick,et al.  A Model for XML Schema Integration , 2002, EC-Web.