Looking at the Web through XML glasses

The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human and make information accessible to applications, in order to offer automation, inter-operation and Web-awareness among services. To do so, information from Web sources needs to be accessible in a structured way. XML and its various extensions (data-models, query languages) are a step in this direction. Unfortunately, the Web is not yet a well organized repository of nicely structured XML documents but rather a conglomerate of volatile HTML pages, for which structure has to be extracted. To address this problem, we present the World Wide Web Wrapper Factory (W4F), a Java toolkit for the generation of wrappers for Web sources. Our main contributions are: (1) an expressive language to specify the extraction of complex structures from HTML pages; (2) a declarative mapping to XML documents, with the automatic generation of the corresponding DTDs; (3) some visual supports to make the engineering of wrappers faster and easier As an illustration, we show how we can, via W4F intermediation, transparently query HTML sources from an XML query language.

[1]  Charles Axel Allen,et al.  WIDL, Application Integration with XML , 1997, World Wide Web journal.

[2]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[3]  Alin Deutsch,et al.  XML-QL: A Query Language for XML , 1998 .

[4]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[5]  Thomas Kistler,et al.  WebL - A Programming Language for the Web , 1998, Comput. Networks.

[6]  Paolo Merialdo,et al.  From Databases to Web-Bases: The ARANEUS Experience , 1998 .

[7]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[8]  David Schach,et al.  XML Query Language (XQL) , 1998, QL.

[9]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[10]  Limsoon Wong,et al.  BioKleisli: a digital library for biomedical researchers , 1997, International Journal on Digital Libraries.

[11]  Sophie Cluet,et al.  Your mediators need data conversion! , 1998, SIGMOD '98.

[12]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[13]  Brad Adelberg,et al.  NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents , 1998, SIGMOD Conference.

[14]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[15]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[16]  Calton Pu,et al.  An XJML-based wrapper generator for Web information extraction , 1999, SIGMOD '99.

[17]  Larry Wall,et al.  Programming Perl - covers Perl 5, 2nd Edition , 1996, A nutshell handbook.

[18]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[19]  Bruce Krulwich Automating the Internet: Agents as User Surrogates , 1997, IEEE Internet Comput..

[20]  Peter Schwarz,et al.  A Wrapper Architecture for Legacy Data Sources1 , 1997, VLDB 1997.

[21]  Shahrokh Saeednia,et al.  How to maintain both privacy and authentication in digital libraries , 2000 .

[22]  Craig A. Knoblock,et al.  Wrapper Induction for Semistructured, Web-based Information Sources , 1998 .

[23]  Maria-Esther Vidal,et al.  Wrapper generation for Web accessible data sources , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[24]  Basile Christophides Vassilis Documents structures et bases de donnees objet , 1996 .

[25]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[26]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..