XML schema clustering with semantic and hierarchical similarity measures

With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis.

[1]  Wolfgang Meier,et al.  eXist: An Open Source Native XML Database , 2002, Web, Web-Services, and Database Systems.

[2]  Evaggelia Pitoura,et al.  Peer-to-peer management of XML data: issues and research challenges , 2005, SGMD.

[3]  Jung-Won Lee,et al.  Finding Maximal Similar Paths Between XML Documents Using Sequential Patterns , 2004, ADVIS.

[4]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[5]  Richi Nayak,et al.  Data Mining and XML Documents , 2002, International Conference on Internet Computing.

[6]  M. Hascoet,et al.  Xyleme, a dynamic warehouse for XML data of the Web , 2001, Proceedings 2001 International Database Engineering and Applications Symposium.

[7]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[8]  Nobutaka Suzuki,et al.  Finding an optimum edit script between an XML document and a DTD , 2005, SAC '05.

[9]  Lukasz A. Kurgan,et al.  Semantic Mapping of XML Tags Using Inductive Machine Learning , 2002, ICMLA.

[10]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[11]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Yun Chi,et al.  Frequent Subtree Mining - An Overview , 2004, Fundam. Informaticae.

[14]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.

[15]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[16]  Horst Bunke,et al.  Classes of cost functions for string edit distance , 2006, Algorithmica.

[17]  Richi Nayak,et al.  XCLS: A Fast and Effective Clustering Algorithm for Heterogenous XML Documents , 2006, PAKDD.

[18]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[19]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[20]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[21]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[22]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[23]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[24]  Richi Nayak,et al.  Knowledge Discovery from XML Documents , 2006, Lecture Notes in Computer Science.

[25]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[26]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[27]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[28]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[29]  Aida Boukottaya,et al.  Schema matching for transforming structured documents , 2005, DocEng '05.

[30]  Korris Fu-Lai Chung,et al.  On the use of hierarchical information in sequential mining-based XML document similarity computation , 2004, Knowledge and Information Systems.

[31]  Chun-Nan Hsu,et al.  Induction of integrated view for XML data with heterogeneous DTDs , 2001, CIKM '01.

[32]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.