HTML Segmentation for Different Types of Web Pages

Search engines manage several types of challenges daily. One of those challenges is locating relevant content in a Web page. However, the concept of relevance in information retrieval depends on the problem to be solved. For instance, the menu of a website does not impact the results of an algorithm to detect duplicate Web pages. An HTML segmentation algorithm partitions a Web page visually in such a way that parts from a same partition are semantically related. This chapter presents two strategies to segment different types of Web pages.

[1]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[2]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.

[3]  Berthier A. Ribeiro-Neto,et al.  A site oriented method for segmenting web pages , 2011, SIGIR.

[4]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[5]  Eraldo Rezende Fernandes,et al.  RelHunter: a machine learning method for relation extraction from text , 2010, Journal of the Brazilian Computer Society.

[6]  Carina F. Dorneles,et al.  Automatic Web Page Segmentation and Noise Removal for Structured Extraction using Tag Path Sequences , 2013, J. Inf. Data Manag..

[7]  Ck Cheng,et al.  The Age of Big Data , 2015 .

[8]  Ravi Bhushan Mishra,et al.  Multiagent Paradigm for the Agent Selection and Negotiation in a B2C Process , 2009, Int. J. Intell. Inf. Technol..

[9]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[10]  Lidong Bing,et al.  Robust detection of semi-structured web records using a DOM structure-knowledge-driven model , 2013, TWEB.

[11]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[12]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[13]  Wolfgang Nejdl,et al.  A densitometric approach to web page segmentation , 2008, CIKM '08.

[14]  Hellen Adams,et al.  Patent and Trademark Office , 2017 .

[15]  K. S. Raghunandan,et al.  Flash Webpage Segmentation Based on Image Perception Using DWT and Morphological Operations , 2013, SocProS.

[16]  Deepayan Chakrabarti,et al.  Page-level template detection via isotonic smoothing , 2007, WWW '07.

[17]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[18]  Deepayan Chakrabarti,et al.  A graph-theoretic approach to webpage segmentation , 2008, WWW.

[19]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[20]  R. Doyle The American terrorist. , 2001, Scientific American.

[21]  Rae-Hong Park,et al.  Object Segmentation Based on a Nonparametric Snake with Motion Prediction in Video , 2012 .

[22]  Balachander Krishnamurthy,et al.  Key differences between Web 1.0 and Web 2.0 , 2008, First Monday.

[23]  James A. Hendler,et al.  Web 3.0 Emerging , 2009, Computer.

[24]  Wee Sun Lee,et al.  Using link analysis to improve layout on mobile devices , 2004, WWW '04.

[25]  Eraldo R. Fernandes,et al.  Clause Identification Using Entropy Guided Transformation Learning , 2009, 2009 Seventh Brazilian Symposium in Information and Human Language Technology.

[26]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.