论文信息 - UJM at INEX 2008 XML Mining Track

UJM at INEX 2008 XML Mining Track

This paper reports our experiments carried out for the INEX XML Mining track, consisting in developing categorization (or classification) and clustering methods for XML documents. We represent XML documents as vectors of indexed terms. For our first participation, the purpose of our experiments is twofold: Firstly, our overall aim is to set up a categorization text only approach that can be used as a baseline for further work which will take into account the structure of the XML documents. Secondly, our goal is to define two criteria (CC and CCE) based on terms distribution for reducing the size of the index. Results of our baseline are good and using our two criteria, we improve these results while we slightly reduce the index term. The results are slightly worse when we sharply reduce the size of the index of terms.

Christophe Moulin | Mathias Géry | Christine Largeron

[1] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[2] Céline Rouveirol,et al. Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[3] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[5] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[6] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8] Ludovic Denoyer,et al. Report on the XML mining track at INEX 2007 categorization and clustering of XML documents , 2008, SIGF.

[9] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .