A Web Site Classification Approach Based On Its Topological Structure

Automatic web site classification has a wide application prospect; however, there are few researches on it. Different from pure texts, web sites are the combination of a large number of web pages via hyperlinks, so text classification methods are not suitable to classify them directly. This paper proposes a web site classification approach based on its topological structure. Given a web site, firstly we represent its topological structure as a directed graph, and from which we extract a strongly connected sub-graph including the site’s home page. Secondly, we use an improved PageRank algorithm on the sub-graph to select some topic-relevant resources, and represent them as a topic vector of the site. Finally we use an SVM classifier to classify the site in term of its topic vector. Some experiments are conducted for web site classification. Experimental results show our approach achieved better performance than traditional super page-based web site classification approach.