Visual Adjacency Multigraphs – a Novel Approach for a Web Page Classification

Standard techniques for a web page classification usually take a simple text-based approach, in which most of the information provided by the visual layout of a page is discarded. In our work we propose a new classification approach based on the visual layout analyses, conducted before implementing standard classification techniques. A page is represented as a hierarchical structure – Visual Adjacency Multigraph, in which nodes represent simple HTML objects (text, images) while directed edges reflect spatial relations ‘immediately before’, ‘immediately after’, ‘immediately left’ and ‘immediately right’ on the browser screen. Using visual information contained in the multigraph, one is able to define heuristics for recognition of common page entities such as vertical and horizontal link lists, titles and subtitles, and paragraphs of text. Visual analyses results in more accurate method for representing the page contents, which splits the text features into different subsets according to the groups they belong to. Finally, we introduce a classification system, which taking into account the proposed layout analysis clearly outperforms a standard bag-of-words approach.