Text-Based Web Page Classification with Use of Visual Information

As the number of pages on the web is permanently increasing, there is a need to classify pages into categories to facilitate indexing or searching them. In the method proposed here, we use both textual and visual information to find a suitable representation of web page content. In this paper, several term weights, based on TF or TF-IDF weighting are proposed. Modification is based on visual areas, in which the text appears and their visual properties. Some results of experiments are included in the final part of the paper.

[1]  Koraljka Golub,et al.  Importance of HTML structural elements and metadata in automated subject classification , 2005 .

[2]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[3]  Eunseok Lee,et al.  A Novel Web Page Analysis Method for Efficient Reasoning of User Preference , 2008, APCHI.

[4]  Abraham Kandel,et al.  Classification Of Web Documents Using Graph Matching , 2004, Int. J. Pattern Recognit. Artif. Intell..

[5]  Radek Burget,et al.  Web Page Element Classification Based on Visual Features , 2009, 2009 First Asian Conference on Intelligent Information and Database Systems.

[6]  Mark Last,et al.  A Simple, Structure-Sensitive Approach for Web Document Classification , 2005, AWIC.

[7]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[8]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[9]  Radek Burget Automatic Document Structure Detection for Data Integration , 2007, BIS.

[10]  Dunja Mladenic,et al.  Turning {{\sc Yahoo!}}\ into an automatic Web page classifier , 1998 .

[11]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..

[12]  Angela Ribeiro,et al.  An Analytical Approach to Concept Extraction in HTML Environments , 2004, Journal of Intelligent Information Systems.

[13]  Veljko Milutinovic,et al.  Visual Adjacency Multigraphs – a Novel Approach for a Web Page Classification , 2004 .