Fast and effective clustering of XML data using structural information

This paper presents the incremental clustering algorithm, XML documents Clustering with Level Similarity (XCLS), that groups the XML documents according to structural similarity. A level structure format is introduced to represent the structure of XML documents for efficient processing. A global criterion function that measures the similarity between the new document and existing clusters is developed. It avoids the need to compute the pair-wise similarity between two individual documents and hence saves a huge amount of computing effort. XCLS is further modified to incorporate the semantic meanings of XML tags for investigating the trade-offs between accuracy and efficiency. The empirical analysis shows that the structural similarity overplays the semantic similarity in the clustering process of the structured data such as XML. The experimental analysis shows that the XCLS method is fast and accurate in clustering the heterogeneous documents by structures.

[1]  Andrew Kennedy,et al.  Proceedings of the 2006 workshop on ML , 2006, ICFP 2006.

[2]  Akhil Kumar,et al.  A dynamic warehouse for XML Data of the Web. , 2001 .

[3]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[4]  Jinyuan You,et al.  CLOPE: a fast and effective clustering algorithm for transactional data , 2002, KDD.

[5]  Wolfgang Meier,et al.  eXist: An Open Source Native XML Database , 2002, Web, Web-Services, and Database Systems.

[6]  Richi Nayak,et al.  Data Mining and XML Documents , 2002, International Conference on Internet Computing.

[7]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[8]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[9]  M. Hascoet,et al.  Xyleme, a dynamic warehouse for XML data of the Web , 2001, Proceedings 2001 International Database Engineering and Applications Symposium.

[10]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[11]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[12]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[13]  G. Karypis,et al.  Criterion functions for document clustering , 2005 .

[14]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[15]  Richi Nayak,et al.  Knowledge Discovery from XML Documents , 2006, Lecture Notes in Computer Science.

[16]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[17]  Fausto Giunchiglia,et al.  Element level semantic matching using WordNet , 2006 .

[18]  Korris Fu-Lai Chung,et al.  On the use of hierarchical information in sequential mining-based XML document similarity computation , 2004, Knowledge and Information Systems.

[19]  Horst Bunke,et al.  Classes of cost functions for string edit distance , 2006, Algorithmica.

[20]  Fausto Giunchiglia,et al.  Element Level Semantic Matching , 2004 .

[21]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[22]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[23]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[24]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[25]  Richi Nayak,et al.  XMine: A Methodology for Mining XML Structure , 2006, APWeb.