A methodical approach to extracting interesting objects from dynamic web pages

This paper presents a fully automated object extraction system for web documents. Our methodology consists of a layered framework and a set of algorithms. A distinct feature of our approach is the full automation of both the extraction of data object regions from dynamic web pages and the identification of the correct object-boundary separators. We implemented the methodology in the XWRAPElite object extraction system and evaluated the system using more than 3200 pages over 75 diverse websites. Our experiments show three important and interesting results: First, our algorithms for identifying the minimal object-rich subtree achieves a 96% success rate over all the web pages we have tested. Second, our algorithms for discovering and extracting object separator tags reach the success rate of 95%. Most significantly, the overall system achieves a precision between 96% and 100% (it returns only correct objects) and excellent recall (between 95% and 96%, with very few significant objects left out). The minimal subtree extraction algorithms and the object-boundary identification algorithms are fast, about 87 milliseconds per page with an average page size of 30KB.

[1]  Brad Adelberg,et al.  NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents , 1998, SIGMOD Conference.

[2]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[3]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[4]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[6]  Oren Etzioni,et al.  A softbot-based interface to the Internet , 1994, CACM.

[7]  Matthias Brosemann,et al.  XML Path Language (XPath) 1.0 — Seminararbeit — , 2004 .

[8]  J. J. Higgins,et al.  Concepts in Probability and Stochastic Modeling , 1994 .

[9]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[10]  not Cwi,et al.  XHTML™ 1.0 The Extensible HyperText Markup Language , 2002 .

[11]  William W. Cohen Recognizing Structure in Web Pages using Similarity Queries , 1999, AAAI/IAAI.

[12]  Ling Liu,et al.  Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web , 2004, Proceedings. 20th International Conference on Data Engineering.

[13]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[14]  Calton Pu,et al.  Wrapper application generation for semantic web: an xwrap approach , 2003 .

[15]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[16]  Arnaud Le Hors,et al.  Document Object Model (DOM) Level 2 Core Specification - Version 1.0 , 2000 .

[17]  Arnaud Sahuguet,et al.  WysiWyg Web Wrapper Factory (W4F) , 1999 .

[18]  Daniel Rocco,et al.  Exploiting the deep web with DynaBot: matching, probing, and ranking , 2005, WWW '05.