Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents

In this work, we propose a new clustering document representation for semi-structured documents collections. Our approach consists on a representation of XML documents based on their sub-paths, defined according to some criteria (length, root beginning, leaf ending) using the structure only or both the structure and the content. By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as K-means that scale up well. We actually use an implementation of the clustering algorithm known as \textit{dynamic clouds} that can work with distinct groups of independent variables. This is necessary in our model since embedded sub-paths are not independent. For validation and evaluation of our method, two collections are used: the INEX corpus and the INRIA activity reports, and a set of metrics well-known in Information Retrieval.

[1]  Jianwu Yang,et al.  A semi-structured document model for text mining , 2008, Journal of Computer Science and Technology.

[2]  Timos K. Sellis,et al.  Clustering XML Documents Using Structural Summaries , 2004, EDBT Workshops.

[3]  Yves Lechevallier,et al.  Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology , 2005, ArXiv.

[4]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[5]  Ludovic Denoyer,et al.  Apprentissage et inférence statistique dans les bases de documents structurés : application aux corpus de documents textuels , 2004 .

[6]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[7]  References , 1971 .

[8]  Gianni Costa,et al.  A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[9]  Chen Xiaoou,et al.  A semi-structured document model for text mining , 2002 .

[10]  Ludovic Denoyer,et al.  Structured multimedia document classification , 2003, DocEng '03.

[11]  Antoine Doucet,et al.  Naïve Clustering of a large XML Document Collection , 2002, INEX Workshop.

[12]  Riccardo Ortale,et al.  Distance-based Clustering of XML Documents , 2003 .

[13]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14]  Vijay V. Raghavan,et al.  BitCube: A Three-Dimensional Bitmap Indexing for XML Documents , 2004, Journal of Intelligent Information Systems.

[15]  Neel Sundaresan,et al.  A classifier for semi-structured documents , 2000, KDD '00.

[16]  Katherine G. Herbert,et al.  XML clustering by principal component analysis , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[17]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[18]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.