论文信息 - Neighborhood Conscious Hypertext Categorization

Neighborhood Conscious Hypertext Categorization

Neighborhood Conscious Hypertext Categorization Ralitsa Angelova Master of Science Department of Computer Science Saarland University 2004 A fundamental issue in statistics, pattern recognition, and machine learning is that of classification. In a traditional classification problem, we wish to assign one of k labels (or classes) to each of n objects (or documents), in a way that is consistent with some observed data available about that problem. For achieving better classification results, we try to capture the information derived by pairwise relationships between objects, in particular hyperlinks between web documents. The usage of hyperlinks poses new problems not addressed in the extensive text classification literature. Links contain high quality semantic clues that a purely text-based classifier can not take advantage of. However, exploiting link information is non-trivial because it is noisy and a naive use of terms in the link neighborhood of a document can degrade accuracy. The problem becomes even harder when only a very small fraction of document labels are known to the classifier and can be used for training, as it is the case in a real classification scenario. Our work is based on an algorithm proposed by Soumen Chakrabarti and uses the theory of Markov Random Fields to derive a relaxation labelling technique for the class assignment problem. We show that the extra information contained in the hyperlinks between the documents can be exploited to achieve significant improvement in the performance of classification. We implemented our algorithm in Java and ran our experiments on two sets of data obtained from the DBLP and IMDB databases. We observed up to 5.5% improvement in the accuracy of the classification and up to 10% higher recall and precision results.

Gerhard Weikum | Ralitsa Angelova

[1] Daphne Koller,et al. Toward Optimal Feature Selection , 1996, ICML.

[2] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3] Jugal K. Kalita,et al. Summarization as feature selection for text categorization , 2001, CIKM '01.

[4] Éva Tardos,et al. Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 2002, JACM.

[5] Prabhakar Raghavan,et al. Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.

[6] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[7] Lionel Pelkowitz,et al. A continuous relaxation labeling algorithm for Markov random fields , 1990, IEEE Trans. Syst. Man Cybern..

[8] David Eppstein,et al. Finding the k Shortest Paths , 1999, SIAM J. Comput..

[9] J. Besag. Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[10] Piotr Indyk,et al. Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.