A weighted common structure based clustering technique for XML documents

XML has recently become very popular as a means of representing semistructured data and as a standard for data exchange over the Web, because of its varied applicability in numerous applications. Therefore, XML documents constitute an important data mining domain. In this paper, we propose a new method of XML document clustering by a global criterion function, considering the weight of common structures. Our approach initially extracts representative structures of frequent patterns from schemaless XML documents using a sequential pattern mining algorithm. Then, we perform clustering of an XML document by the weight of common structures, without a measure of pairwise similarity, assuming that an XML document is a transaction and frequent structures extracted from documents are items of the transaction. We conducted experiments to compare our method with previous methods. The experimental results show the effectiveness of our approach.

[1]  Richi Nayak,et al.  XCLS: A Fast and Effective Clustering Algorithm for Heterogenous XML Documents , 2006, PAKDD.

[2]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[3]  Sergio Greco,et al.  Toward Semantic XML Clustering , 2006, SDM.

[4]  Zhonghang Xia,et al.  Clustering XML Documents Based on Structural Similarity , 2007, DASFAA.

[5]  Jinyuan You,et al.  CLOPE: a fast and effective clustering algorithm for transactional data , 2002, KDD.

[6]  Ah Chung Tsoi,et al.  Clustering XML Documents Using Self-organizing Maps for Structures , 2005, INEX.

[7]  Hiroki Arimura,et al.  Optimized Substructure Discovery for Semi-structured Data , 2002, PKDD.

[8]  Bing Wang,et al.  Clustering Schemaless XML Documents , 2003, CoopIS/DOA/ODBASE.

[9]  Denilson Barbosa,et al.  The XML web: a first study , 2003, WWW '03.

[10]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[11]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[12]  Timos K. Sellis,et al.  Clustering XML Documents by Structure , 2004, SETN.

[13]  Gianni Costa,et al.  A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[14]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[15]  Jeong Hee Hwang,et al.  A New Sequential Mining Approach to XML Document Clustering* , 2005, APWeb.

[16]  Vijay V. Raghavan,et al.  BitCube: A Three-Dimensional Bitmap Indexing for XML Documents , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[17]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[19]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[20]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[21]  Charu C. Aggarwal,et al.  Xproj: a framework for projected structural clustering of xml documents , 2007, KDD '07.

[22]  Tian-yang Lv,et al.  XML Clustering Based on Common Neighbor , 2006, APWeb Workshops.

[23]  Nikos Tsirakis,et al.  XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries , 2008, SAC '08.

[24]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[25]  Richi Nayak,et al.  Data Mining and XML Documents , 2002, International Conference on Internet Computing.

[26]  Katherine G. Herbert,et al.  XML clustering by principal component analysis , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[27]  Lawrence B. Holder,et al.  Knowledge discovery from structural data , 1995, Journal of Intelligent Information Systems.

[28]  Jaroslav Zendulka,et al.  An XML Framework Proposal for Knowledge Discovery in Databases , 2000 .

[29]  Antoine Doucet,et al.  Naïve Clustering of a large XML Document Collection , 2002, INEX Workshop.

[30]  Florent Masseglia,et al.  Sequential Pattern Mining for Structure-Based XML Document Classification , 2005, INEX.

[31]  Jennifer Widom Data Management for XML: Research Directions , 1999, IEEE Data Eng. Bull..