Web Page Segmentation Towards Information Extraction for Web Semantics

Today, web is a large source of information which may be structured or unstructured. The need is efficient information extraction from various unstructured sources on the web. Therefore, information extraction is playing a prominent role in the current scenario. It focuses on automatically extracting structured information from unstructured distributed resources on the web and is based on several approaches. Web page segmentation is one of the most significant techniques where a web page is broken down into semantically related parts. There are various approaches to Web page segmentation. In this paper, the first information extraction has been explored, discussed and reviewed. Second, a revisit has been done on web page segmentation and its various approaches where a comparative analysis has been made. Third, various phases of vision-based web page segmentation have been presented and reviewed along with a flowchart. Finally, the results and conclusions have been presented along with the future work.

[1]  Andres Sanoja,et al.  Block-o-Matic: A web page segmentation framework , 2014, 2014 International Conference on Multimedia Computing and Systems (ICMCS).

[2]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[3]  Michael Cormier,et al.  Towards an Improved Vision-Based Web Page Segmentation Algorithm , 2017, 2017 14th Conference on Computer and Robot Vision (CRV).

[4]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[5]  Luis Gravano,et al.  Sampling strategies for information extraction over the deep web , 2017, Inf. Process. Manag..

[6]  Wolfgang Nejdl,et al.  A densitometric approach to web page segmentation , 2008, CIKM '08.

[7]  Wenzhe Zhang,et al.  Web Page Segmentation and Its Application for Web Information Crawling , 2016, 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI).

[8]  Jaroslav Zendulka,et al.  Box clustering segmentation: A new method for vision-based web page preprocessing , 2017, Inf. Process. Manag..

[9]  Yang Tao,et al.  Web Page Adaptation for Mobile Device , 2008, 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing.

[10]  Stéphane Gançarski,et al.  Web page segmentation evaluation , 2015, SAC.

[11]  G. Aghila,et al.  Multidimensional Web Page Evaluation Model Using Segmentation And Annotations , 2012, ArXiv.

[12]  Stéphane Gançarski,et al.  Using visual pages analysis for optimizing web archiving , 2010, EDBT '10.

[13]  I. V. Ramakrishnan,et al.  Csurf: a context-driven non-visual web-browser , 2007, WWW '07.

[14]  Michael Cormier,et al.  Purely vision-based segmentation of web pages for assistive technology , 2016, Comput. Vis. Image Underst..