Automatic Web Page Classification

Aim of this paper is to describe a method of automatic web page classification to semantic domains and its evaluation. The classifica- tion method exploits machine learning algorithms and several morpho- logical as well as semantical text processing tools. In contrast to general text document classification, in the web document classification there are often problems with short web pages. In this paper we proposed two ap- proaches to eliminate the lack of information. In the first one we consider a wider context of a web page. That means we analyze web pages refer- enced from the investigated page. The second approach is based on so- phisticated term clustering by their similar grammatical context. This is done using statistic corpora tool the Sketch Engine.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[3]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[4]  John M. Pierre,et al.  On the Automated Classification of Web Sites , 2001, ArXiv.

[5]  Susan T. Dumais,et al.  Probabilistic combination of text classifiers using reliability indicators: models and results , 2002, SIGIR '02.

[6]  Takashi Washio,et al.  Automatic Web-Page Classification by Using Machine Learning Methods , 2001, Web Intelligence.

[7]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[8]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[9]  A. Kilgarriff,et al.  Thesauruses for natural language processing , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[10]  Petr Berka Dobývání znalostí z databází , 2003 .

[11]  Ali Selamat,et al.  Web page feature selection and classification using neural networks , 2004, Inf. Sci..

[12]  Yiming Yang,et al.  Hypertext Categorization using Hyperlink Patterns and Meta Data , 2001, ICML.

[13]  Usama M. Fayyad,et al.  On the Handling of Continuous-Valued Attributes in Decision Tree Generation , 1992, Machine Learning.

[14]  Dunja Mladenic,et al.  Turning Yahoo to Automatic Web-Page Classifier , 1998, European Conference on Artificial Intelligence.

[15]  Marina Santini Some Issues in Automatic Genre Classification of Web Pages , 2006 .

[16]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..