论文信息 - A probabilistic description-oriented approach for categorizing web documents

A probabilistic description-oriented approach for categorizing web documents

The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) using a representation of the content of web documents that captures these two characteristics and (2) using more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of the k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.

[1] Piotr Indyk,et al. Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[2] Christos Faloutsos,et al. Proceedings of the 1999 ACM SIGMOD international conference on Management of data , 1999, SIGMOD 1999.

[3] Gerhard Knorz,et al. Automatisches Indexieren als Erkennen abstrakter Objekte , 1983 .

[4] C. J. van Rijsbergen,et al. Towards an information logic , 1989, SIGIR '89.

[5] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[6] Yiyu Yao,et al. On modeling information retrieval with probabilistic inference , 1995, TOIS.

[7] Yiming Yang,et al. Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[8] Norbert Fuhr,et al. Models for retrieval with probabilistic indexing , 1989, Inf. Process. Manag..

[9] Chris Buckley,et al. A probabilistic learning approach for document indexing , 1991, TOIS.

[10] Chris Buckley,et al. Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models , 1992, TREC.