论文信息 - Web Data Extraction Based on Tree Structure Analysis and Template Generation

Web Data Extraction Based on Tree Structure Analysis and Template Generation

This paper studies the problem of extracting data from large numbers of semi-structured web pages. The fact that many websites have enormous pages generated dynamically from a underlying structured source like a database makes it feasible to induct a common template for similar web pages and then extract data accordingly. Previous work on this problem has limited practical utility because of either requiring significant human efforts or basing on several brittle assumptions. We propose a three-step approach, including template generation, template detection and data extraction, with a little human intervention in template edit. The core algorithm is based on two highly efficient tree structure analysis techniques. Experimental results show that our approach can extract web data in a high accuracy and flexibility.

Jing Li | Guoshi Wu | Haikun Hong | Xiaoxin Chen

[1] Charles Schafer,et al. Bootstrapping Information Extraction from Semi-structured Web Pages , 2008, ECML/PKDD.

[2] Bing Liu,et al. Web data extraction based on partial tree alignment , 2005, WWW '05.

[3] Valter Crescenzi,et al. Automatic annotation of data extracted from large Web sites , 2003, WebDB.

[4] Wuu Yang,et al. Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[5] Alberto H. F. Laender,et al. Automatic web news extraction using tree edit distance , 2004, WWW '04.

[6] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[7] Jane Yung-jen Hsu,et al. Tree-Structured Template Generation for Web Pages , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[8] Ruihua Song,et al. Joint optimization of wrapper generation and template detection , 2007, KDD '07.

[9] Valter Crescenzi,et al. Wrapping-oriented classification of web pages , 2002, SAC '02.

[10] Chia-Hui Chang,et al. IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[11] Hector Garcia-Molina,et al. Extracting structured data from Web pages , 2003, SIGMOD '03.