News articles and Web directories represent some of the most popular and commonly accessed content on the Web. Information designers normally define categories that model these knowledge domains (i.e. news topics or Web categories) and domain experts assign documents to these categories. The paper describes how machine learning and automatic document classification techniques can be used for managing large numbers of news articles, or Web page descriptions, lightening the load on domain experts. The paper uses two datasets, one with with more than 800,000 Reuters news stories and another with over 41,000 Web sites, and classifies them using a Naive Bayes algorithm, into predefined categories. We discuss the different parameters and design decisions that normally appear when building automatic classifiers, including, stemming, stop-words, thresholding, amount of data and approaches for improving performance using the structure in XML documents. The methodology developed would enable Web based applications or workflow systems to manage information more efficiently, i.e. by assigning documents to topics automatically or assisting humans in the process of doing so.
[1]
Yiming Yang,et al.
A Comparative Study on Feature Selection in Text Categorization
,
1997,
ICML.
[2]
Yiming Yang,et al.
A study of thresholding strategies for text categorization
,
2001,
SIGIR '01.
[3]
David D. Lewis,et al.
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
,
1998,
ECML.
[4]
David D. Lewis,et al.
A comparison of two learning algorithms for text categorization
,
1994
.
[5]
Rafael A. Calvo,et al.
A framework for text categorization
,
2002,
ADCS.
[6]
Rafael A. Calvo,et al.
Intelligent document classification
,
2000,
Intell. Data Anal..
[7]
Mark Stevenson,et al.
The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources
,
2002,
LREC.
[8]
Fabrizio Sebastiani,et al.
Machine learning in automated text categorization
,
2001,
CSUR.
[9]
Yiming Yang,et al.
A re-examination of text categorization methods
,
1999,
SIGIR '99.