Web classification using support vector machine

In web classification, web pages from one or more web sites are assigned to pre-defined categories according to their content. Since web pages are more than just plain text documents, web classification methods have to consider using other context features of web pages, such as hyperlinks and HTML tags. In this paper, we propose the use of Support Vector Machine (SVM) classifiers to classify web pages using both their text and context feature sets. We have experimented our web classification method on the WebKB data set. Compared with earlier Foil-Pilfs method on the same data set, our method has been shown to perform very well. We have also shown that the use of context features especially hyperlinks can improve the classification performance significantly.

[1]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[2]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[3]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[4]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[5]  Katharina Morik,et al.  Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring , 1999, ICML.

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[8]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[9]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[10]  Mark Craven,et al.  Relational Learning with Statistical Predicate Invention: Better Models for Hypertext , 2001, Machine Learning.

[11]  Ben Taskar,et al.  Probabilistic Models of Text and Link Structure for Hypertext Classification , 2001 .

[12]  Donato Malerba,et al.  A Machine Learning Approach to Web Mining , 1999, AI*IA.

[13]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[14]  Dunja Mladenic,et al.  Turning Yahoo to Automatic Web-Page Classifier , 1998, European Conference on Artificial Intelligence.

[15]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[16]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[17]  David D. Lewis,et al.  Applying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks , 2001, TREC.

[18]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.