Box clustering segmentation: A new method for vision-based web page preprocessing

New, purely vision-based, segmentation technique is formally described.Only a few simple visual cues are used to assess similarity of the rectangles.Its performance better by an order of magnitude when compared with competition.Rectangle clustering is a viable way to perform web page segmentation. This paper presents a novel approach to web page segmentation, which is one of substantial preprocessing steps when mining data from web documents. Most of the current segmentation methods are based on algorithms that work on a tree representation of web pages (DOM tree or a hierarchical rendering model) and produce another tree structure as an output.In contrast, our method uses a rendering engine to get an image of the web page, takes the smallest rendered elements of that image, performs clustering using a custom algorithm and produces a flat set of segments of a given granularity. For the clustering metrics, we use purely visual properties only: the distance of elements and their visual similarity.We experimentally evaluate the properties of our algorithm by processing 2400web pages. On this set of web pages, we prove that our algorithm is almost 90% faster than the reference algorithm. We also show that our algorithm accuracy is between 47% and 133% of the reference algorithm accuracy with indirect correlation of our algorithms accuracy to the depth of inspected page structure. In our experiments, we also demonstrate the advantages of producing a flat segmentation structure instead of an hierarchy.

[1]  Reda Alhajj,et al.  Effectiveness of template detection on noise reduction and websites summarization , 2013, Inf. Sci..

[2]  Lejian Liao,et al.  A hybrid approach for content extraction with text density and visual importance of DOM nodes , 2013, Knowledge and Information Systems.

[3]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[4]  Mie Mie Su Thwin,et al.  Web Page Segmentation and Informative Content Extraction for Effective Information Retrieval , 2014 .

[5]  Liang Liu,et al.  An Improved VIPS-Based Algorithm of Extracting Web Content , 2014 .

[6]  Yu-Chieh Wu Language independent web news extraction system based on text detection framework , 2016, Inf. Sci..

[7]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[8]  Tingting Wei,et al.  Web page segmentation based on the hough transform and vision cues , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[9]  Shinde Santaji Krishna,et al.  Schema inference and data extraction from templatized Web pages , 2015, 2015 International Conference on Pervasive Computing (ICPC).

[10]  Chengcui Zhang,et al.  An FAR-SW based approach for webpage information extraction , 2014, Inf. Syst. Frontiers.

[11]  Jer Lang Hong,et al.  Information extraction for search engines using fast heuristic techniques , 2010, Data Knowl. Eng..

[12]  Chengfei Liu,et al.  AutoRM: An effective approach for automatic Web data record mining , 2015, Knowl. Based Syst..

[13]  Pavlina Fragkou Information Extraction versus Text Segmentation for Web Content Mining , 2013, Int. J. Softw. Eng. Knowl. Eng..

[14]  Andres Sanoja,et al.  Block-o-Matic: A web page segmentation framework , 2014, 2014 International Conference on Multimedia Computing and Systems (ICMCS).

[15]  Radek Burget,et al.  Cluster-based page segmentation-a fast and precise method for web page pre-processing , 2013, WIMS '13.

[16]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[17]  Hayri Volkan Agun,et al.  Web content extraction by using decision tree learning , 2012, 2012 20th Signal Processing and Communications Applications Conference (SIU).

[18]  Ye Tian,et al.  Segmenting Webpage with Gomory-Hu Tree Based Clustering , 2011, J. Softw..

[19]  M. Elgin Akpinar,et al.  Vision Based Page Segmentation Algorithm: Extended and Perceived Success , 2013, ICWE Workshops.

[20]  Keishi Tajima,et al.  Extracting Logical Hierarchical Structure of HTML Documents Based on Headings , 2015, Proc. VLDB Endow..

[21]  Samiran Chattopadhyay,et al.  Mobile-enabled content adaptation system for e-learning websites using segmentation algorithm , 2014, The 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA 2014).

[22]  Bo Gao,et al.  Multiple Template Detection Based on Segments , 2014, ICDM.

[23]  Michael Cormier,et al.  Purely vision-based segmentation of web pages for assistive technology , 2016, Comput. Vis. Image Underst..

[24]  Abhay Sharma,et al.  Understanding Color Management , 2003 .

[25]  Pierre Beust,et al.  A Hybrid Segmentation of Web Pages for Vibro-Tactile Access on Touch-Screen Devices , 2014, VL@COLING.

[26]  Radek Burget Layout Based Information Extraction from HTML Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[27]  Ashish Kumar Software Architecture Styles a Survey , 2014 .

[28]  Stefan Conrad,et al.  Page segmentation by web content clustering , 2011, WIMS '11.

[29]  Yeliz Yesilada,et al.  Vision Based Page Segmentation: Extended and Improved Algorithm , 2014 .

[30]  Steven Pemberton,et al.  Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Specification , 2010 .

[31]  Lidong Bing,et al.  Web page segmentation with structured prediction and its application in web page classification , 2014, SIGIR.

[32]  Zhen Xu,et al.  Identifying semantic blocks in Web pages using Gestalt laws of grouping , 2016, World Wide Web.

[33]  Radek Burget Visual Area Classification for Article Identification in Web Documents , 2010, 2010 Workshops on Database and Expert Systems Applications.

[34]  Hayri Volkan Agun,et al.  A hybrid approach for extracting informative content from web pages , 2013, Inf. Process. Manag..

[35]  Jun Zeng,et al.  A Web Page Segmentation Approach Using Visual Semantics , 2014, IEICE Trans. Inf. Syst..

[36]  Wei-Ying Ma,et al.  Improving pseudo-relevance feedback in web information retrieval using web page segmentation , 2003, WWW '03.

[37]  Yang Song,et al.  Extracting news content with visual unit of web pages , 2015, 2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[38]  Zhong-Liang Xiang,et al.  Wrapper induction of news information for feeding to social networking service on smartphone , 2015, 2015 17th International Conference on Advanced Communication Technology (ICACT).

[39]  Panagiotis Papapetrou,et al.  Extracting news text from web pages: an application for the visually impaired , 2015, PETRA.

[40]  Hassan F. Eldirdiery,et al.  Detecting and Removing Noisy Data on Web Document using Text Density Approach , 2015 .

[41]  Jun Kong,et al.  Web Interface Interpretation Using Graph Grammars , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[42]  B M Patil,et al.  Template Extraction from Heterogeneous Web Pages with Cosine Similarity , 2014 .

[43]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[44]  G. Potdar,et al.  Template Extraction from Heterogeneous Web Pages , 2012 .

[45]  Clément de Groc,et al.  Mining Product Features from the Web: A Self-supervised Approach , 2012, WEBIST.

[46]  Kun Jiang,et al.  Noise Reduction of Web Pages via Feature Analysis , 2015, 2015 2nd International Conference on Information Science and Control Engineering.

[47]  Dhaval Patel,et al.  Removing Noise Content from Online News Articles , 2014, COMAD.

[48]  L. Hubert,et al.  Comparing partitions , 1985 .

[49]  Claudio Feijóo,et al.  Emerging Perspectives on the Mobile Content Evolution , 2015 .

[50]  Radek Burget,et al.  Information Extraction from Web Sources Based on Multi-aspect Content Analysis , 2015, SemWebEval@ESWC.

[51]  David A. Bell,et al.  Extracting Data Records from Query Result Pages Based on Visual Features , 2011, BNCOD.

[52]  David A. Bell,et al.  Automatically Annotating Structured Web Data Using a SVM-Based Multiclass Classifier , 2014, WISE.

[53]  Hassan F. Eldirdiery,et al.  Web Document Segmentation for Better Extraction of Information: A Review , 2015 .

[54]  Salvador Tamarit,et al.  TeMex: The Web Template Extractor , 2015, WWW.

[55]  Donato Malerba,et al.  Extracting general lists from web documents: a hybrid approach , 2011, IEA/AIE'11.