Extracting Web Content by Exploiting Multi-Category Characteristics

Extracting web content aims at separating web content from web pages since web content is organized and presented by different HTML templates and is surrounded by various information. Knowing little about template structures and noise information before extraction, the variability of page templates, etc., make the extraction process very challenging to guarantee extraction precision and extraction adaptability. This study proposes an effective web content extraction method for various web environments. To ensure extraction performance, we exploited three kinds of characteristics, visual text information, content semantics(instead of HTML tag semantics) and web page structures. These characteristics are then integrated into an extraction framework for extraction decisions for different websites. Comparative experiments on multiple web sites with two popular extraction methods, CETR and CETD, show that our proposed extraction method outperforms CETR on precision when keeping the same advantage on recall, and also gains 4% improvement over CETD on the average F1-score; especially, our method can provide better extraction performance when facing short content than CETD, and presents a better extraction adaptability.

[1]  Tim Furche,et al.  WADaR: Joint Wrapper and Data Repair , 2015, Proc. VLDB Endow..

[2]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[3]  Tim Furche,et al.  Robust and Noise Resistant Wrapper Induction , 2016, SIGMOD Conference.

[4]  Berthier A. Ribeiro-Neto,et al.  Computing block importance for searching on web sites , 2007, CIKM '07.

[5]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[6]  Li Li,et al.  Web news extraction via path ratios , 2013, CIKM.

[7]  Aoying Zhou,et al.  Automatic Extraction Rules Generation Based on XPath Pattern Learning , 2010, WISE Workshops.

[8]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.

[9]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[10]  Valter Crescenzi,et al.  Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future , 2016, SKDD.

[11]  Nasrullah Memon,et al.  Hybrid model of content extraction , 2012, J. Comput. Syst. Sci..

[12]  A. F. R. Rahman,et al.  Content Extraction from HTML Documents , 2001 .

[13]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[14]  Gail E. Kaiser,et al.  Automating Content Extraction of HTML Documents , 2005, World Wide Web.

[15]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[16]  Hayri Volkan Agun,et al.  A hybrid approach for extracting informative content from web pages , 2013, Inf. Process. Manag..

[17]  Li Li,et al.  Web News Extraction via Tag Path Feature Fusion Using DS Theory , 2016, Journal of Computer Science and Technology.

[18]  Matthew E. Peters,et al.  Content extraction using diverse feature sets , 2013, WWW.

[19]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.

[20]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.