Towards an Improved Vision-Based Web Page Segmentation Algorithm

In this paper we introduce an edge-based segmentation algorithm designed for web pages. We consider each web page as an image and perform segmentation as the initial stage of a planned parsing system that will also include region classification. The motivation for our work is to enable improved online experiences for users with assistive needs (serving as the back-end process for such front-end tasks as zooming and decluttering the image being presented to those with visual or cognitive challenges, or producing less unwieldy output from screenreaders). Our focus is therefore on the interpretation of a class of man-made images (where web pages consist of one particular set of these images which have important constraints that assist in performing the processing). After clarifying some comparisons with an earlier model of ours, we show validation for our method. Following this, we briefly discuss the contribution for the field of computer vision, offering a contrast with current work in segmentation focused on the processing of natural images.

[1]  Ping Zhong,et al.  Detecting Web Content Function Using Generalized Hidden Markov Model , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[2]  Karyn Moffatt,et al.  Older-adult HCI: why should we care? , 2013, INTR.

[3]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[4]  John F. Haddon,et al.  Generalised threshold selection for edge detection , 1988, Pattern Recognit..

[5]  C. A. Murthy,et al.  Thresholding in edge detection: a statistical approach , 2004, IEEE Transactions on Image Processing.

[6]  Michael Cormier,et al.  Classification via Hidden Markov Trees for a Vision-Based Approach to Conveying Webpages to Users with Assistive Needs , 2016, 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[7]  D J Field,et al.  Relations between the statistics of natural images and the response properties of cortical cells. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[8]  Michael Cormier,et al.  Purely vision-based segmentation of web pages for assistive technology , 2016, Comput. Vis. Image Underst..

[9]  William A. Barrett,et al.  Interactive Segmentation with Intelligent Scissors , 1998, Graph. Model. Image Process..

[10]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[11]  Yili Hong,et al.  On computing the distribution function for the Poisson binomial distribution , 2013, Comput. Stat. Data Anal..

[12]  Aaron Andersen,et al.  Improving the outcomes of students with cognitive and learning disabilities: phase I development for a web accessibility tool , 2007, Assets '07.

[13]  Alan L. Yuille,et al.  Statistical Edge Detection: Learning and Evaluating Edge Cues , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Hassan F. Eldirdiery,et al.  Web Document Segmentation for Better Extraction of Information: A Review , 2015 .

[15]  Guillermo Sapiro,et al.  Geodesic Active Contours , 1995, International Journal of Computer Vision.

[16]  Approved for External Publication © Copyright 2009 Hewlett-Packard Development Company , 2022 .

[17]  Jiuxin Cao,et al.  A segmentation method for web page analysis using shrinking and dividing , 2010, Int. J. Parallel Emergent Distributed Syst..

[18]  Francesca Cesarini,et al.  Structured document segmentation and representation by the modified X-Y tree , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).