XML Document Clustering Using Common XPath

XML is becoming a common way of storing data. The elements and their arrangement in the document’s hierarchy not only describe the document structure but also imply the data’s semantic meaning, and hence provide valuable information to develop tools for manipulating XML documents. In this paper, we pursue a data mining approach to the problem of XML document clustering. We introduce a novel XML structural representation called common XPath (CXP), which encodes the frequently occurring elements with the hierarchical information, and propose to take the CXPs mined to form the feature vectors for XML document clustering. In other words, data mining acts as a feature extractor in the clustering process. Based on this idea, we devise a path-based XML document clustering algorithm called PBClustering which groups the documents according to their CXPs, i.e. their frequent structures. Encouraging simulation results are observed and reported.

[1]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  Korris Fu-Lai Chung,et al.  On the use of hierarchical information in sequential mining-based XML document similarity computation , 2004, Knowledge and Information Systems.

[3]  Fionn Murtagh,et al.  Clustering of XML documents , 2000 .

[4]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[6]  Joachim Hammer,et al.  Element matching across data-oriented XML sources using a multi-strategy clustering model , 2004, Data Knowl. Eng..

[7]  Anil K. Jain,et al.  Knowledge-based clustering scheme for collection management and retrieval of library books , 1995, Pattern Recognit..

[8]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Vijay V. Raghavan,et al.  BitCube: Clustering and Statistical Analysis for XML Documents , 2001 .

[10]  Antoine Doucet,et al.  Naïve Clustering of a large XML Document Collection , 2002, INEX Workshop.

[11]  Vijay V. Raghavan,et al.  Bitmap Indexing-based Clustering and Retrieval of XML Documents , 2001 .

[12]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[13]  Vijay V. Raghavan,et al.  BitCube: A Three-Dimensional Bitmap Indexing for XML Documents , 2004, Journal of Intelligent Information Systems.

[14]  Ge Yu,et al.  PathGuide: an efficient clustering based indexing method for XML path expressions , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..