Hybrid Method for Automated News Content Extraction from the Web

Web news content extraction is vital to improve news indexing and searching in nowadays search engines, especially for the news searching service. In this paper we study the Web news content extraction problem and propose an automated extraction algorithm for it. Our method is a hybrid one taking the advantage of both sequence matching and tree matching techniques. We propose TSReC, a variant of tag sequence representation suitable for both sequence matching and tree matching, along with an associated algorithm for automated Web news content extraction. By implementing a prototype system for Web news content extraction, the empirical evaluation is conducted and the result shows that our method is highly effective and efficient.

[1]  Bing Liu,et al.  NET - A System for Extracting Web Data from Flat and Nested Data Records , 2005, WISE.

[2]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[3]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[4]  Anne H. H Ngu,et al.  Web Information Systems Engineering - WISE 2005, 6th International Conference on Web Information Systems Engineering, New York, NY, USA, November 20-22, 2005, Proceedings , 2005, WISE.

[5]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[6]  S da SilvaAltigran,et al.  A brief survey of web data extraction tools , 2002 .

[7]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[8]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[9]  Bing Liu WISE-2005 Tutorial: Web Content Mining , 2005, WISE.

[10]  Valter Crescenzi,et al.  Wrapping-oriented classification of web pages , 2002, SAC '02.

[11]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[12]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[13]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[14]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[15]  Shuming Shi,et al.  Title extraction from bodies of HTML documents and its application to web page retrieval , 2005, SIGIR '05.

[16]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[17]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[18]  Xiaofeng Meng,et al.  Postal Address Detection fromWeb Documents , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[19]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.