论文信息 - Web page classification using visual layout analysis

Web page classification using visual layout analysis

Automatic processing of Web documents is an important issue in the design of search engines, of Web mining tools, and of applications for Web information extraction. Simple text-based approaches are typically used in which most of the information provided by the page visual layout is discarded. Only some visual features, as the font face and size, are effectively used to weigh the importance of the words in the page. In this paper, we propose to use a hierarchical representation, which includes the visual screen coordinates for every HTML object in the page. The use of the visual layout allows us to identify common page components such as the header, the navigation bars, the left and right menus, the footer, and the informative parts of the page. The recognition of the functional role of each object is performed by a set of heuristic rules. The experimental results show that page areas are correctly classified in 73% of the cases. The identification of different functional areas on the page allows the definition of a more accurate method for representing the page text contents, which splits the text features into different subsets according to the area they belong to. We show that this approach can improve the classification accuracy for page topic categorization by more than 10% with respect to the use of a flat “bag-of-words” representation.

M. Gori | Marco Maggini | Michelangelo Diligenti | Miloš Kovaèeviæ | Veljko Milutinoviæ

[1] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[2] Yves Chauvin,et al. Backpropagation: theory, architectures, and applications , 1995 .

[3] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[4] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[6] Terry Winograd,et al. Representing structured information in audio interfaces: a framework for selecting audio marking techniques to represent document structures , 1998 .

[7] David W. Embley,et al. Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[8] C. Lee Giles,et al. Accessibility of information on the web , 1999, Nature.

[9] Andrew McCallum,et al. Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[10] Martin van den Berg,et al. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[11] Marco Gori,et al. Focused Crawling Using Context Graphs , 2000, VLDB.

[12] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[13] Michael Bernard,et al. Criteria for optimal web design (designing for usability) , 2003 .

[14] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.