Automatic Web Content Extraction by Combination of Learning and Grouping

Web pages consist of not only actual content, but also other elements such as branding banners, navigational elements, advertisements, copyright etc. This noisy content is typically not related to the main subjects of the webpages. Identifying the part of actual content, or clipping web pages, has many applications, such as high quality web printing, e-reading on mobile devices and data mining. Although there are many existing methods attempting to address this task, most of them can either work only on certain types of Web pages, e.g. article pages, or has to develop different models for different websites. We formulate the actual content identifying problem as a DOM tree node selection problem. We develop multiple features by utilizing the DOM tree node properties to train a machine learning model. Then candidate nodes are selected based on the learning model. Based on the observation that the actual content is usually located in a spatially continuous block, we develop a grouping technology to further filter out noisy data and pick missing data for the candidate nodes. We conduct extensive experiments on a real dataset and demonstrate our solution has high quality outputs and outperforms several baseline methods.

[1]  Jian Pei,et al.  Can we learn a template-independent wrapper for news article extraction from a single training site? , 2009, KDD.

[2]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.

[3]  Shuming Shi,et al.  Title extraction from bodies of HTML documents and its application to web page retrieval , 2005, SIGIR '05.

[4]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[5]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[6]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[7]  Jian Fan,et al.  Automatic selection of print-worthy content for enhanced web page printing experience , 2010, DocEng '10.

[8]  Enhong Chen,et al.  Harnessing the wisdom of the crowds for accurate web page clipping , 2012, KDD.

[9]  Liang Chen,et al.  Template detection for large scale search engines , 2006, SAC '06.

[10]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[11]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[12]  Ping Luo,et al.  Article clipper: a system for web article extraction , 2011, KDD.

[13]  Dan Roth,et al.  Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[14]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[15]  A. K. Singh,et al.  An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining , 2004, CIT.

[16]  Wolfgang Nejdl,et al.  A densitometric approach to web page segmentation , 2008, CIKM '08.

[17]  Ping Luo,et al.  Web article extraction for web printing: a DOM+visual based approach , 2009, DocEng '09.

[18]  Ming-Syan Chen,et al.  Mining Web informative structures and contents based on entropy analysis , 2004, IEEE Transactions on Knowledge and Data Engineering.

[19]  Berthier A. Ribeiro-Neto,et al.  Computing block importance for searching on web sites , 2007, CIKM '07.

[20]  Patrick Gallinari,et al.  Document structure meets page layout: loopy random fields for web news content extraction , 2010, DocEng '10.

[21]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.