Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto

Lixto is a system and method for the visual and interactive generation of wrappers for Web pages under the supervision of a human developer, for automatically extracting information from Web pages using such wrappers, and for translating the extracted content into XML. This paper describes some advanced features of Lixto, such as disjunctive pattern definitions, specialization rules, and Lixto's capability of collecting and aggregating information from several linked Web pages.

[1]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[2]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[3]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[4]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[5]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[6]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[7]  Georg Gottlob,et al.  Supervised Wrapper Generation with Lixto , 2001, VLDB.

[8]  Bertram Ludäscher,et al.  A Unified Framework for Wrapping, Mediating and Restructuring Information from the Web , 1999, ER.

[9]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[10]  Brad Adelberg,et al.  NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents , 1998, SIGMOD Conference.

[11]  Sudarshan S. Chawathe,et al.  Describing and Manipulating XML Data , 1999, IEEE Data Eng. Bull..

[12]  I. V. Ramakrishnan,et al.  Computational aspects of resilient data extraction from semistructured sources (extended abstract) , 2000, PODS '00.

[13]  A. Vansant Cut and paste. , 2002, Pediatric physical therapy : the official publication of the Section on Pediatrics of the American Physical Therapy Association.

[14]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.