A tool-supported method to extract data and schema from Web sites

This paper presents a tool-supported method to reengineer Web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualized into a unique schema describing the domain covered by the whole Web site. Finally, this conceptual schema is used to build the database of a renovated Web site. These principles are illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.

[1]  not Cwi,et al.  XHTML™ 1.0 The Extensible HyperText Markup Language , 2002 .

[2]  Vincent Englebert,et al.  DB-Main: un atelier d'ingénierie de bases de données , 1995, BDA.

[3]  Paolo Tonella,et al.  Using clustering to support the migration from static to dynamic web pages , 2003, 11th IEEE International Workshop on Program Comprehension, 2003..

[4]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[5]  W. Glas Xml and Databases , 2002 .

[6]  Paolo Atzeni,et al.  XML AND DATABASES , 2004 .

[7]  Jean-Luc Hainaut,et al.  Schema Transformation Techniques for Database Reverse Engineering , 1993, ER.

[8]  Eric van der Vlist,et al.  XML Schema , 2002 .

[9]  Dave Raggett Clean Up Your Web Pages with HTML TIDY , 1999 .

[10]  Cornelia Boldyreff,et al.  Reverse engineering to achieve maintainable WWW sites , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[11]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[12]  Michael Gertz,et al.  Reverse engineering for Web data: from visual to semantic structures , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Jean Vanderdonckt,et al.  Flexible reverse engineering of web pages with VAQUISTA , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[14]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[15]  Vincent Englebert,et al.  Database reverse engineering: From requirements to CARE tools , 2004, Automated Software Engineering.

[16]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[17]  Jim Melton,et al.  XML schema , 2003, SGMD.