Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Abstract Extracting web content is to obtain the required data embedded in web pages, usually including structured records, such as product information, and text content, such as news. Web pages use a large number of HTML tags to organize and to present various information. Both knowing little about the structures of web pages and mixing kinds of information in web pages are making the extraction process very challenging to guarantee extraction performance and extraction adaptability. This study proposes a unified web content extraction framework that can be applied in various web environments to extract both structured records and text content. First, we construct a characteristic container to hold kinds of characteristics related with extraction objectives, including visual text information, content semantics(instead of HTML tag semantics), web page structures, etc. Second, the above characteristics are integrated into an extraction framework for extraction decisions on different web sites. Especially, we put forward different strategies, path aggregation for extracting text content and HMM model for structured records, to locate the extraction area by exploiting both those extraction characteristics. Comparative experiments on multiple web sites with popular extraction methods, including CETR, CETD and CNBE, show that our proposed extraction method can provide better extraction precision and extraction adaptability.

[1]  Kai Zheng,et al.  Efficient Clue-Based Route Search on Road Networks , 2017, IEEE Transactions on Knowledge and Data Engineering.

[2]  Li Li,et al.  Web news extraction via path ratios , 2013, CIKM.

[3]  Valter Crescenzi,et al.  Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future , 2016, SKDD.

[4]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[5]  Nasrullah Memon,et al.  Hybrid model of content extraction , 2012, J. Comput. Syst. Sci..

[6]  Wai Lam,et al.  An unsupervised framework for extracting and normalizing product attributes from multiple web sites , 2008, SIGIR '08.

[7]  Nicholas Jing Yuan,et al.  Approximate keyword search in semantic trajectory database , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[8]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[9]  A. F. R. Rahman,et al.  Content Extraction from HTML Documents , 2001 .

[10]  Tim Furche,et al.  WADaR: Joint Wrapper and Data Repair , 2015, Proc. VLDB Endow..

[11]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[12]  Aoying Zhou,et al.  Automatic Extraction Rules Generation Based on XPath Pattern Learning , 2010, WISE Workshops.

[13]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[14]  Kai Zheng,et al.  Keyword-aware continuous kNN query on road networks , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[15]  Yan Guo,et al.  Simultaneous Product Attribute Name and Value Extraction from Web Pages , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[16]  Gail E. Kaiser,et al.  Automating Content Extraction of HTML Documents , 2005, World Wide Web.

[17]  Hayri Volkan Agun,et al.  A hybrid approach for extracting informative content from web pages , 2013, Inf. Process. Manag..

[18]  Li Li,et al.  Web News Extraction via Tag Path Feature Fusion Using DS Theory , 2016, Journal of Computer Science and Technology.

[19]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[20]  Lidong Bing,et al.  Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer Reviews , 2016, TOIT.

[21]  Matthew E. Peters,et al.  Content extraction using diverse feature sets , 2013, WWW.

[22]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.

[23]  Berthier A. Ribeiro-Neto,et al.  Computing block importance for searching on web sites , 2007, CIKM '07.

[24]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.

[25]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[26]  Tim Furche,et al.  Robust and Noise Resistant Wrapper Induction , 2016, SIGMOD Conference.

[27]  Divesh Srivastava,et al.  DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web , 2015, Proc. VLDB Endow..

[28]  Jiajie Xu,et al.  Popularity-aware spatial keyword search on activity trajectories , 2016, World Wide Web.

[29]  Jiajie Xu,et al.  Interactive Top-k Spatial Keyword queries , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[30]  Jayant Madhavan,et al.  Harvesting relational tables from lists on the web , 2009, The VLDB Journal.