Structured Data Extraction from the Web Based on Partial Tree Alignment

This paper studies the problem of structured data extraction from arbitrary Web pages. The objective of the proposed research is to automatically segment data records in a page, extract data items/fields from these records, and store the extracted data in a database. Existing methods addressing the problem can be classified into three categories. Methods in the first category provide some languages to facilitate the construction of data extraction systems. Methods in the second category use machine learning techniques to learn wrappers (which are data extraction programs) from human labeled examples. Manual labeling is time-consuming and is hard to scale to a large number of sites on the Web. Methods in the third category are based on the idea of automatic pattern discovery. However, multiple pages that conform to a common schema are usually needed as the input. In this paper, we propose a novel and effective technique (called DEPTA) to perform the task of Web data extraction automatically. The method consists of two steps: 1) identifying individual records in a page and 2) aligning and extracting data items from the identified records. For step 1, a method based on visual information and tree matching is used to segment data records. For step 2, a novel partial alignment technique is proposed. This method aligns only those data items in a pair of records that can be aligned with certainty, making no commitment on the rest of the items. Experimental results obtained using a large number of Web pages from diverse domains show that the proposed two-step technique is highly effective

[1]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[2]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[3]  Christoph M. Hoffmann,et al.  Pattern Matching in Trees , 1982, JACM.

[4]  Zhao Li,et al.  WICCAP: from semi-structured data to structured data , 2004, Proceedings. 11th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems, 2004..

[5]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[6]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[7]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[8]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[9]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[10]  G. H. Gonnet,et al.  Handbook of algorithms and data structures: in Pascal and C (2nd ed.) , 1991 .

[11]  David R. Karger,et al.  Thresher: automating the unwrapping of semantic content from the World Wide Web , 2005, WWW '05.

[12]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[13]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[14]  Ángel Viña,et al.  The Wargo system: semi-automatic wrapper generation in presence of complex data access modes , 2002, Proceedings. 13th International Workshop on Database and Expert Systems Applications.

[15]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[16]  Tao Jiang,et al.  Alignment of Trees - An Alternative to Tree Edit , 1994, Theor. Comput. Sci..

[17]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[18]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[19]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[20]  P. Hogeweg,et al.  The alignment of sets of sequences and the construction of phyletic trees: An integrated method , 2005, Journal of Molecular Evolution.

[21]  Weimin Chen,et al.  New Algorithm for Ordered Tree-to-Tree Correction Problem , 2001, J. Algorithms.

[22]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[23]  Valter Crescenzi,et al.  Automatic annotation of data extracted from large Web sites , 2003, WebDB.

[24]  Yonatan Aumann,et al.  Structural extraction from visual layout of documents , 2002, CIKM '02.

[25]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[26]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[27]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[28]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[29]  Gabriel Valiente,et al.  An Efficient Bottom-Up Distance between Trees , 2001, SPIRE.

[30]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[31]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[32]  Bing Liu,et al.  Extracting Web Data Using Instance-Based Learning , 2005, World Wide Web.

[33]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[34]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[35]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[36]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[37]  S da SilvaAltigran,et al.  DEByE - Date extraction by example , 2002 .

[38]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[39]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..