WysiWyg Web Wrapper Factory (W4F)

In this paper, we present the W4F toolkit for the generation of wrappers for Web sources. W4F consists of a retrieval language to identify Web sources, a declarative extraction language (the HTML Extraction Language) to express robust extraction rules and a mapping interface to export the extracted information into some user-defined data-structures. To assist the user and make the creation of wrappers rapid and easy, the toolkit offers some wysiwyg support via some wizards. Together, they permit the fast and semi-automatic generation of ready-to-go wrappers provided as Java classes. W4F has been successfully used to generate wrappers for database systems and software agents, making the content of Web sources easily accessible to any kind of application.

[1]  Charles Axel Allen,et al.  WIDL, Application Integration with XML , 1997, World Wide Web journal.

[2]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[3]  David Raggett Clean Up Your Web Pages with HP's HTML Tidy , 1998, Comput. Networks.

[4]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[5]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[6]  Alin Deutsch,et al.  XML-QL: A Query Language for XML , 1998 .

[7]  Sophie Cluet,et al.  Using YAT to Build a Web Server , 1998, WebDB.

[8]  Thomas Kistler,et al.  WebL - A Programming Language for the Web , 1998, Comput. Networks.

[9]  Paolo Merialdo,et al.  From Databases to Web-Bases: The ARANEUS Experience , 1998 .

[10]  Craig A. Knoblock,et al.  Wrapper Induction for Semistructured, Web-based Information Sources , 1998 .

[11]  David Schach,et al.  XML Query Language (XQL) , 1998, QL.

[12]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .