Building intelligent Web applications using lightweight wrappers

Abstract The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human. Unfortunately, the Web is not yet a well organized repository of nicely structured documents but rather a conglomerate of volatile HTML pages. To address this problem, we present the World Wide Web Wrapper Factory (W4F), a toolkit for the generation of wrappers for Web sources, that offers: (1) an expressive language to specify the extraction of complex structures from HTML pages; (2) a declarative mapping to various data formats like XML; (3) some visual tools to make the engineering of wrappers faster and easier.

[1]  Steffen Staab,et al.  On2broker: Semantic-Based Access to Information Sources at the WWW , 1999, Intelligent Information Integration.

[2]  David Schach,et al.  XML Query Language (XQL) , 1998, QL.

[3]  Larry Wall,et al.  Programming Perl , 1991 .

[4]  R. G. G. Cattell,et al.  The Object Database Standard: ODMG-93 , 1993 .

[5]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[6]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[7]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[8]  Brad Adelberg,et al.  NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents , 1998, SIGMOD Conference.

[9]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[10]  N. Kushmerik Gleaning the Web , 1999, IEEE Intell. Syst..

[11]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[12]  David Jordan,et al.  The Object Database Standard: ODMG 2.0 , 1997 .

[13]  Sophie Cluet,et al.  Your mediators need data conversion! , 1998, SIGMOD '98.

[14]  Abel,et al.  A formal semantics of patterns in XSLT , 2000 .

[15]  Paolo Merialdo,et al.  The Araneus Web-based management system , 1998, SIGMOD '98.

[16]  Charles Axel Allen,et al.  WIDL, Application Integration with XML , 1997, World Wide Web journal.

[17]  Arnaud Sahuguet,et al.  Web Ecology: Recycling HTML Pages as XML Documents Using W4F , 1999, WebDB.

[18]  Arnaud Sahuguet,et al.  Looking at the Web through XML glasses , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[19]  Alin Deutsch,et al.  XML-QL: A Query Language for XML , 1998 .

[20]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[21]  Craig A. Knoblock,et al.  Wrapper Induction for Semistructured, Web-based Information Sources , 1998 .

[22]  Maria-Esther Vidal,et al.  Wrapper generation for Web accessible data sources , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[23]  Bruce Krulwich Automating the Internet: Agents as User Surrogates , 1997, IEEE Internet Comput..

[24]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[25]  Calton Pu,et al.  An XJML-based wrapper generator for Web information extraction , 1999, SIGMOD '99.

[26]  Peter Schwarz,et al.  A Wrapper Architecture for Legacy Data Sources1 , 1997, VLDB 1997.

[27]  Thomas Kistler,et al.  WebL - A Programming Language for the Web , 1998, Comput. Networks.

[28]  Paolo Merialdo,et al.  From Databases to Web-Bases: The ARANEUS Experience , 1998 .

[29]  J. Hendler Gleaning the Web , 1999 .