Semi-Automatic Wrapper Generation for Commercial Web Sources

Semi-automatic wrapper generation tools aim to ease the task of building structured views over semi-structured web sources. But the wrapper generation techniques presented up to date are unable to properly deal with sources requiring complex navigational sequences for accessing data. In this paper, we present WARGO, a semiautomatic wrapper generation tool, which has been used by non-programmer staff to successfully wrap more than 700 commercial web sources in several industrial applications. We describe our approach for wrapper generation and show the difficulties found with other systems for wrapping this kind of sources.

[1]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[2]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[3]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[4]  Ángel Viña,et al.  The Wargo system: semi-automatic wrapper generation in presence of complex data access modes , 2002, Proceedings. 13th International Workshop on Database and Expert Systems Applications.

[5]  Thomas Kistler,et al.  WebL - A Programming Language for the Web , 1998, Comput. Networks.

[6]  Paolo Atzeni,et al.  Cut and Paste , 1999, J. Comput. Syst. Sci..

[7]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[8]  A. Vansant Cut and paste. , 2002, Pediatric physical therapy : the official publication of the Section on Pediatrics of the American Physical Therapy Association.

[9]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[10]  Michael Stonebraker,et al.  Content integration for e-business , 2001, SIGMOD '01.