Unsupervised Classification of Text-Centric XML Document Collections

This paper addresses the problem of the unsupervised classification of text-centric XML documents. In the context of the INEX mining track 2006, we present methods to exploit the inherent structural information of XML documents in the document clustering process. Using the k-means algorithm, we have experimented with a couple of feature sets, to discover that a promising direction is to use structural information as a preliminary means to detect and put aside structural outliers. The improvement of the semantic-wise quality of clustering is significantly higher through this approach than through a combination of the structural and textual feature sets.

[1]  Ah Chung Tsoi,et al.  XML Document Mining Using Contextual Self-organizing Maps for Structures , 2006, INEX.

[2]  Ludovic Denoyer,et al.  Report on the XML Mining Track at INEX 2005 and INEX 2006 , 2006, INEX.

[3]  Yves Lechevallier,et al.  Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology , 2005, ArXiv.

[4]  Gabriella Kazai,et al.  Advances in XML Information Retrieval and Evaluation, 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Dagstuhl Castle, Germany, November 28-30, 2005, Revised Selected Papers , 2006, INEX.

[5]  Antoine Doucet,et al.  Naïve Clustering of a large XML Document Collection , 2002, INEX Workshop.

[6]  Isabelle Tellier,et al.  Transforming XML Trees for Efficient Classification and Clustering , 2005, INEX.

[7]  Ludovic Denoyer,et al.  Report on the XML mining track at INEX 2005 and INEX 2006: categorization and clustering of XML documents , 2007, SIGF.

[8]  Anastasios Tombros,et al.  The effectiveness of query-based hierarchic clustering of documents for information retrieval , 2002 .

[9]  Ludovic Denoyer,et al.  The Wikipedia XML corpus , 2006, SIGF.

[10]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[11]  Fionn Murtagh,et al.  Clustering of XML documents , 2000 .

[12]  Miro Lehtonen,et al.  Preparing heterogeneous XML for full-text search , 2006, TOIS.

[13]  Gabriella Kazai,et al.  Advances in XML Information Retrieval and Evaluation: 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Dagstuhl ... Papers (Lecture Notes in Computer Science) , 2006 .

[14]  Norbert Fuhr,et al.  Advances in XML information retrieval and evaluation : 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Dagstuhl Castle, Germany, November 28-30, 2005 : revised selected papers , 2006 .

[15]  Neel Sundaresan,et al.  A classifier for semi-structured documents , 2000, KDD '00.

[16]  Ah Chung Tsoi,et al.  Document Mining Using Graph Neural Network , 2006, INEX.

[17]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[18]  Justin Zobel,et al.  Detection of video sequences using compact signatures , 2006, TOIS.

[19]  Yves Lechevallier,et al.  Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents , 2006, EGC.

[20]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[21]  Ludovic Denoyer,et al.  The XML Wikipedia Corpus , 2006 .