论文信息 - Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology

Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology

This paper presents some experiments in clustering homogeneous XML documents to validate an existing classification or more generally an organisational structure. Our approach integrates techniques for extracting knowledge from docu- ments with unsupervised classification (clustering) of documents. We focus on the feature selection used for representing documents and its impact on the emerging clas- sification. We mix the selection of structured features with fine textual selection based on syntactic characteristics. We illustrate and evaluate this approach with a collection of Inria activity reports for the year 2003. The objective is to cluster projects into larger groups (Themes), based on the keywords or different chapters of these activity reports. We then compare the results of clustering using different feature selections, with the official theme structure used by Inria.

[1] Helmut Schmid,et al. Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[2] Chinatsu Aone,et al. Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[3] Oren Etzioni,et al. Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[4] Ludovic Denoyer,et al. Structured multimedia document classification , 2003, DocEng '03.

[5] Thierry Despeyroux,et al. Practical semantic analysis of web sites and documents , 2004, WWW '04.

[6] C. J. van Rijsbergen,et al. The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[7] Neel Sundaresan,et al. A classifier for semi-structured documents , 2000, KDD '00.

[8] Helmut Schmidt,et al. Probabilistic part-of-speech tagging using decision trees , 1994 .

[9] Gérard Govaert,et al. Classification automatique de donnees environnement statistique et informatique , 1989 .

[10] Jianwu Yang,et al. A semi-structured document model for text mining , 2008, Journal of Computer Science and Technology.

[11] Fionn Murtagh,et al. Clustering of XML documents , 2000 .

[12] Vijay V. Raghavan,et al. BitCube: Clustering and Statistical Analysis for XML Documents , 2001 .