A Quantitative Comparison of Semantic Web Page Segmentation Algorithms

This thesis explores the effectiveness of different semantic Web page segmentation algorithms on modern websites. We compare the BlockFusion, PageSegmenter, VIPS and the novel WebTerrain algorithm, which was developed as part of this thesis, to each other. We introduce a new testing framework that allows to selectively run different algorithms on different datasets and that subsequently automatically compares the generated results to the ground truth. We used it to run each algorithm in eight different configurations where we varied datasets, evaluation metric and the type of the input HTML documents for a total of 32 combinations. We found that all algorithms performed better on random pages on average than on popular pages. The reason for this is most likely the higher complexity of popular pages. Furthermore the results are better when running the algorithms on the HTML obtained from the DOM than on the plain HTML. Of the different algorithms BlockFusion has the lowest F-score on average and WebTerrain the highest. Overall there is still room for improvement as we find the best average F-score to be 0.49. Drum, so wandle nur wehrlos Fort durchs Leben, und fürchte nichts! (Friedrich Hölderlin)

[1]  A. Turing On Computable Numbers, with an Application to the Entscheidungsproblem. , 1937 .

[2]  Yeliz Yesilada,et al.  Vision Based Page Segmentation: Extended and Improved Algorithm , 2014 .

[3]  Yeliz Yesilada,et al.  Web Page Segmentation: A Review , 2014 .

[4]  Emre Velipasaoglu,et al.  Identifying primary content from web pages and its application to web search ranking , 2011, WWW.

[5]  Jiuxin Cao,et al.  A segmentation method for web page analysis using shrinking and dividing , 2010, Int. J. Parallel Emergent Distributed Syst..

[6]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[7]  Wolfgang Nejdl,et al.  A densitometric approach to web page segmentation , 2008, CIKM '08.

[8]  Deepayan Chakrabarti,et al.  A graph-theoretic approach to webpage segmentation , 2008, WWW.

[9]  Shumeet Baluja,et al.  Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework , 2006, WWW '06.

[10]  Hasan Davulcu,et al.  Semantic Partitioning of Web Pages , 2005, WISE.

[11]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[12]  Ramanathan V. Guha,et al.  TAP: A Semantic Web Test-bed , 2003, J. Web Semant..

[13]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[14]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[15]  Veljko M. Milutinovic,et al.  Recognition of common areas in a Web page using visual information: a possible application in a page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[16]  J. Robie,et al.  Doc-ument object model (DOM) level 3 core specification , 2004 .

[17]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .