Web page classification using visual layout analysis

Automatic processing of Web documents is an important issue in the design of search engines, of Web mining tools, and of applications for Web information extraction. Simple text-based approaches are typically used in which most of the information provided by the page visual layout is discarded. Only some visual features, as the font face and size, are effectively used to weigh the importance of the words in the page. In this paper, we propose to use a hierarchical representation, which includes the visual screen coordinates for every HTML object in the page. The use of the visual layout allows us to identify common page components such as the header, the navigation bars, the left and right menus, the footer, and the informative parts of the page. The recognition of the functional role of each object is performed by a set of heuristic rules. The experimental results show that page areas are correctly classified in 73% of the cases. The identification of different functional areas on the page allows the definition of a more accurate method for representing the page text contents, which splits the text features into different subsets according to the area they belong to. We show that this approach can improve the classification accuracy for page topic categorization by more than 10% with respect to the use of a flat “bag-of-words” representation.