论文信息 - Semi-supervised Categorization of Wikipedia Collection by Label Expansion

Semi-supervised Categorization of Wikipedia Collection by Label Expansion

We address the problem of categorizing a large set of linked documents with important content and structure aspects, for example, from Wikipedia collection proposed at the INEX XML Mining track. We cope with the case where there is a small number of labeled pages and a very large number of unlabeled ones. Due to the sparsity of the link based structure of Wikipedia, we apply the spectral and graph-based techniques developed in the semi-supervised machine learning. We use the content and structure views of Wikipedia collection to build a transductive categorizer for the unlabeled pages. We report evaluation results obtained with the label propagation function which ensures a good scalability on sparse graphs.

Boris Chidlovskii | Boris Chidlovskii

[1] Xiaojin Zhu,et al. Semi-Supervised Learning Literature Survey , 2005 .

[2] Dirk Riehle. How and why Wikipedia works: an interview with Angela Beesley, Elisabeth Bauer, and Kizu Naoko , 2006, WikiSym '06.

[3] Zoubin Ghahramani,et al. Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[4] Bernhard Schölkopf,et al. Learning with Local and Global Consistency , 2003, NIPS.

[5] Daniel S. Weld,et al. Autonomously semantifying wikipedia , 2007, CIKM '07.

[6] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .