XML clustering by principal component analysis

XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for storing, querying, indexing and accessing XML documents. In This work we propose a new approach to clustering XML data. In contrast to previous work, which focused on documents defined by different DTDs, the proposed method works for documents with the same DTD. Our approach is to extract features from documents, modeled by ordered labeled trees, and transform the documents to vectors in a high-dimensional Euclidean space based on the occurrences of the features in the documents. We then reduce the dimensionality of the vectors by principal component analysis (PCA) and cluster the vectors in the reduced dimensional space. The PCA enables one to identify vectors with co-occurrent features, thereby enhancing the accuracy of the clustering. Experimental results based on documents obtained from Wisconsin's XML data bank show the effectiveness and good performance of the proposed techniques.

[1]  Kaizhong Zhang,et al.  ATreeGrep: approximate searching in unordered trees , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[2]  Richi Nayak,et al.  Data Mining and XML Documents , 2002, International Conference on Internet Computing.

[3]  Heikki Mannila,et al.  Ordered and Unordered Tree Inclusion , 1995, SIAM J. Comput..

[4]  Laks V. S. Lakshmanan,et al.  TAX: A Tree Algebra for XML , 2001, DBPL.

[5]  Ron Shamir,et al.  Faster subtree isomorphism , 1997, Proceedings of the Fifth Israeli Symposium on Theory of Computing and Systems.

[6]  SEN ZHANG,et al.  XML Query by Example , 2002, Int. J. Comput. Intell. Appl..

[7]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[8]  Kaizhong Zhang,et al.  Evaluating a class of distance-mapping algorithms for data mining and clustering , 1999, KDD '99.

[9]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[10]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[11]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[12]  Katherine G. Herbert,et al.  Information retrieval and data mining , 2004 .

[13]  Divesh Srivastava,et al.  Counting twig matches in a tree , 2001, Proceedings 17th International Conference on Data Engineering.

[14]  Kaizhong Zhang,et al.  An Index Structure for Data Mining and Clustering , 2000, Knowledge and Information Systems.

[15]  Amit Kumar,et al.  Correlating XML data streams using tree-edit distance embeddings , 2003, PODS '03.

[16]  Zvi Galil,et al.  Faster tree pattern matching , 1994, JACM.