Topical Host Reputation for Lightweight Url Classification

Classification of urls into topical categories is an important task of data minining and information filtering. In many applications the task needs to be performed with minimal information, which usually means just the url itself. While for some topics using the url information is surprisingly effective, there is still a substantial loss in accuracy when compared to basing the classification on full web page content. In this work we stipulate that the basic url-based approach can be significantly improved by taking the web-graph information into account, and in particular by precomputing the topical likelihood for the known hosts. While graph mining is computationally intensive, it is performed off-line and the url classification process simply consumes its results, which comes down to a simple table lookup operation. We demonstrate that the proposed approach not only greatly improves on using url-strings as sources of features but also outperforms techniques utilizing full html content. While the set of categories considered in our experiments is not exhaustive, the results are highly encouraging and the proposed approach could be potentially beneficial in a variety of different topic settings.