A Flexible Structured-Based Representation for XML Document Mining

This paper reports on the INRIA group’s approach to XML mining while participating in the INEX XML Mining track 2005. We use a flexible representation of XML documents that allows taking into account the structure only or both the structure and content. Our approach consists of representing XML documents by a set of their sub-paths, defined according to some criteria (length, root beginning, leaf ending). By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as k-means. We use an implementation of the clustering algorithm known as dynamic clouds that can work with distinct groups of independent modalities put in separate variables. This is useful in our model since embedded sub-paths are not independent: we split potentially dependant paths into separate variables, resulting in each of them containing independant paths. Experiments with the INEX collections show good results for the structure-only collections, but our approach could not scale well for large structure-and-content collections.

[1]  Isabelle Tellier,et al.  Transforming XML Trees for Efficient Classification and Clustering , 2005, INEX.

[2]  Chen Xiaoou,et al.  A semi-structured document model for text mining , 2002 .

[3]  Richi Nayak,et al.  XML Documents Clustering by Structures , 2005, INEX.

[4]  Ah Chung Tsoi,et al.  Clustering XML Documents Using Self-organizing Maps for Structures , 2005, INEX.

[5]  Timos K. Sellis,et al.  Clustering XML Documents Using Structural Summaries , 2004, EDBT Workshops.

[6]  Fionn Murtagh,et al.  Clustering of XML documents , 2000 .

[7]  Neel Sundaresan,et al.  A classifier for semi-structured documents , 2000, KDD '00.

[8]  Vijay V. Raghavan,et al.  BitCube: A Three-Dimensional Bitmap Indexing for XML Documents , 2004, Journal of Intelligent Information Systems.

[9]  Isabelle Tellier,et al.  SSC: statistical subspace clustering , 2005, EGC.

[10]  Gianni Costa,et al.  A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[11]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[12]  Riccardo Ortale,et al.  Distance-based Clustering of XML Documents , 2003 .

[13]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[14]  Katherine G. Herbert,et al.  XML clustering by principal component analysis , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[15]  Antoine Doucet,et al.  Naïve Clustering of a large XML Document Collection , 2002, INEX Workshop.

[16]  Jianwu Yang,et al.  A semi-structured document model for text mining , 2008, Journal of Computer Science and Technology.

[17]  Richi Nayak,et al.  Knowledge Discovery from XML Documents , 2006, Lecture Notes in Computer Science.

[18]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[19]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[20]  Ludovic Denoyer,et al.  Apprentissage et inférence statistique dans les bases de documents structurés : application aux corpus de documents textuels , 2004 .

[21]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[22]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[23]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[24]  Timos K. Sellis,et al.  Clustering XML Documents by Structure , 2004, SETN.

[25]  Yves Lechevallier,et al.  Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents , 2006, EGC.

[26]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[27]  Ludovic Denoyer,et al.  Structured multimedia document classification , 2003, DocEng '03.