论文信息 - An automatic wrapper generation process for large scale crawling of news websites

An automatic wrapper generation process for large scale crawling of news websites

The creation and maintenance of a large-scale news content aggregator is a tedious task, which requires more than a simple RSS aggregator. Many news sites appear every day on the Internet, providing new content in different refresh rates; well established news sites restrict access to their content only to subscribers or online readers, without offering RSS feeds, whereas other sites update their CMS or website tem-plate and lead crawlers to fetch errors. The main problem that arises from this continuous generation and alteration of pages on the Internet is the automated discovery of the appropriate and useful content and the dynamic rules that crawlers need to apply in order not to become outdated. In this paper we present an innovative mechanism for extracting useful content (title, body and media) from news articles web pages, based on automatic extraction of patterns that form each domain. The system is able to achieve high performance by combining information gathered while discovering the structure of a news site, together with "knowledge" that acquires at each crawling step, in order to improve the quality of the next steps of its own procedure. Additionally, the system can recognize changes in patterns in order to rebuild the domain rules whenever the domain changes structure. This system has been successfully implemented in palo.rs, the first news search engine in Serbia.

Nikos Tsirakis | Iraklis Varlamis | Panagiotis Tsantilas | Vasilis Poulopoulos

[1] Kareem Darwish,et al. Automatic Extraction of Textual Elements from News Web Pages , 2008, LREC.

[2] Xiaowei Wang,et al. News Information Extraction Based on Adaptive Weighting Using Unsupervised Bayesian Algorithm , 2011, WISM.

[3] Wei-Ying Ma,et al. VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[4] Ji-Rong Wen,et al. Template-Independent News Extraction Based on Visual Consistency , 2007, AAAI.

[5] Frederick H. Lochovsky,et al. Data extraction and label assignment for web databases , 2003, WWW '03.

[6] Hongjun Lu,et al. Toward Learning Based Web Query Processing , 2000, VLDB.

[7] Xiaoli Li,et al. Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[8] Gabriel Zaccak,et al. Wrapster : semi-automatic wrapper generation for semi-structured websites , 2007 .

[9] Hao Yu,et al. Automatic Wrapper Generation and Maintenance , 2011, PACLIC.

[10] Craig A. Knoblock,et al. A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[11] Nicholas Kushmerick,et al. Wrapper Induction for Information Extraction , 1997, IJCAI.

[12] Jianwu Yang,et al. A very efficient approach to news title and content extraction on the web , 2011, JCDL '11.

[13] Chun-Nan Hsu,et al. Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..