Web data extraction based on partial tree alignment

This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items/fields from them and put the data in a database table. This problem has been studied by several researchers. However, existing methods still have some serious limitations. The first class of methods is based on machine learning, which requires human labeling of many examples from each Web site that one is interested in extracting data from. The process is time consuming due to the large number of sites and pages on the Web. The second class of algorithms is based on automatic pattern discovery. These methods are either inaccurate or make many assumptions. This paper proposes a new method to perform the task automatically. It consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. For step 1, we propose a method based on visual information to segment data records, which is more accurate than existing methods. For step 2, we propose a novel partial alignment technique based on tree matching. Partial alignment means that we align only those data fields in a pair of data records that can be aligned (or matched) with certainty, and make no commitment on the rest of the data fields. This approach enables very accurate alignment of multiple data records. Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately.

[1]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[2]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[3]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[4]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[5]  Hsin-Hsi Chen,et al.  Mining Tables from Large Scale HTML Texts , 2000, COLING.

[6]  Yonatan Aumann,et al.  Structural extraction from visual layout of documents , 2002, CIKM '02.

[7]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[8]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[9]  P. Hogeweg,et al.  The alignment of sets of sequences and the construction of phyletic trees: An integrated method , 2005, Journal of Molecular Evolution.

[10]  Weimin Chen,et al.  New Algorithm for Ordered Tree-to-Tree Correction Problem , 2001, J. Algorithms.

[11]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[12]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[13]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[14]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[15]  Kaizhong Zhang,et al.  An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[17]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[18]  Gabriel Alejandro,et al.  Tree edit distance and common subtrees , 2002 .

[19]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[20]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[21]  Lakshmish Ramaswamy,et al.  Automatic detection of fragments in dynamically generated web pages , 2004, WWW '04.

[22]  Huang Yu Extracting Semi-Structured Information from the WEB , 2000 .

[23]  Ricardo A. Baeza-Yates,et al.  Algorithms for string searching , 1989, SIGF.

[24]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[25]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[26]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[27]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[28]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[29]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[30]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[31]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[32]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[33]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[34]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[35]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.