论文信息 - Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents

Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents

In this work, we propose a new clustering document representation for semi-structured documents collections. Our approach consists on a representation of XML documents based on their sub-paths, defined according to some criteria (length, root beginning, leaf ending) using the structure only or both the structure and the content. By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as K-means that scale up well. We actually use an implementation of the clustering algorithm known as \textit{dynamic clouds} that can work with distinct groups of independent variables. This is necessary in our model since embedded sub-paths are not independent. For validation and evaluation of our method, two collections are used: the INEX corpus and the INRIA activity reports, and a set of metrics well-known in Information Retrieval.

Yves Lechevallier | Anne-Marie Vercoustre | Thierry Despeyroux | Mounir Fegas

[1] Jianwu Yang,et al. A semi-structured document model for text mining , 2008, Journal of Computer Science and Technology.

[2] Timos K. Sellis,et al. Clustering XML Documents Using Structural Summaries , 2004, EDBT Workshops.

[3] Yves Lechevallier,et al. Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology , 2005, ArXiv.

[4] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[5] Ludovic Denoyer,et al. Apprentissage et inférence statistique dans les bases de documents structurés : application aux corpus de documents textuels , 2004 .

[6] G. Karypis,et al. Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[7] References , 1971 .

[8] Gianni Costa,et al. A Tree-Based Approach to Clustering XML Documents by Structure , 2004, PKDD.

[9] Chen Xiaoou,et al. A semi-structured document model for text mining , 2002 .

[10] Ludovic Denoyer,et al. Structured multimedia document classification , 2003, DocEng '03.

[11] Antoine Doucet,et al. Naïve Clustering of a large XML Document Collection , 2002, INEX Workshop.

[12] Riccardo Ortale,et al. Distance-based Clustering of XML Documents , 2003 .

[13] Alexandre Termier,et al. TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14] Vijay V. Raghavan,et al. BitCube: A Three-Dimensional Bitmap Indexing for XML Documents , 2004, Journal of Intelligent Information Systems.

[15] Neel Sundaresan,et al. A classifier for semi-structured documents , 2000, KDD '00.

[16] Katherine G. Herbert,et al. XML clustering by principal component analysis , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[17] H. V. Jagadish,et al. Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[18] Chinatsu Aone,et al. Fast and effective text mining using linear-time document clustering , 1999, KDD '99.