Representation of Web Data in A Web Warehouse

We believe that, to manage Web data effectively, there is a need to build a data warehouse of Web data, i.e. a Web warehouse. In this paper, we focus on how to represent and store relevant hyperlinked Web documents effectively in a Web warehouse called WHOWEDA (WareHouse Of WEb DAta) for further querying and manipulation. We present a simple and general model for representing metadata, structure and content of Web documents and hyperlinks in WHOWEDA. We discuss node and link objects which are used to represent Web documents and hyperlinks respectively in WHOWEDA. These objects are first class objects in our data model called WHOM (WareHouse Object Model) which is designed to represent and manipulate Web data in the warehouse. An important feature of our model is that it represents metadata, content and structure as trees called node and link metadata trees, and node and link data trees.

[1]  Alan J. Kent,et al.  The structured information manager (SIM) , 1998, SIGIR '98.

[2]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[3]  Ee-Peng Lim,et al.  Storage Management of a Historical Web Warehousing System , 2000, DEXA.

[4]  Kaj Grønbæk,et al.  Designing Dexter-based hypermedia services for the World Wide Web , 1997, HYPERTEXT '97.

[5]  Sourav S. Bhowmick,et al.  Information Coupling in Web Databases , 1998, ER.

[6]  Laks V. S. Lakshmanan,et al.  SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems , 1996, VLDB.

[7]  Randall H. Trigg,et al.  Design issues for a Dexter-based hypermedia system , 1992, ECHT '92.

[8]  Dan Suciu,et al.  A query language for a Web-site management system , 1997, SGMD.

[9]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[10]  Hugh C. Davis,et al.  The Microcosm Link Service and its Application to the World Wide Web , 1994, WWW Spring 1994.

[11]  Randall H. Trigg,et al.  Design issues for a Dexter-based hypermedia system , 1994, CACM.

[12]  Ron Sacks-Davis,et al.  The Structured Information Manager: A Database System for SGML Documents , 1996, VLDB.

[13]  David Konopnicki,et al.  Information gathering in the World-Wide Web: the W3QL query language and the W3QS system , 1998, TODS.

[14]  Peter J. Nürnberg,et al.  What was the question? Reconciling open hypermedia and World Wide Web research , 1999, Hypertext.

[15]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[16]  Laks V. S. Lakshmanan,et al.  On the Logical Foundations of Schema Integration and Evolution in Heterogeneous Database Systems , 1993, DOOD.

[17]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[18]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[19]  Les Carr,et al.  Applying Open Hypertext Principles to the WWW , 1995, IWHD.

[20]  Bertram Ludäscher,et al.  On a Declarative Semantics for Web Queries , 1997, DOOD.

[21]  Laks V. S. Lakshmanan,et al.  A declarative language for querying and restructuring the Web , 1996, Proceedings RIDE '96. Sixth International Workshop on Research Issues in Data Engineering.

[22]  Sourav S. Bhowmick,et al.  WHOM: a data model and algebra for a web warehouse , 2001 .

[23]  C. M. Sperberg-McQueen,et al.  Extensible markup language , 1997 .

[24]  Hugh C. Davis,et al.  Light hypermedia link services: a study of third party application integration , 1994, ECHT '94.

[25]  Ee-Peng Lim,et al.  Locating Web information using Web checkpoints , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[26]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[27]  Jonathan Robie,et al.  Document Object Model (DOM) Level 2 Specification , 1998 .

[28]  Paolo Merialdo,et al.  To Weave the Web , 1997, VLDB.

[29]  M. Kifer,et al.  F-logic : A "Higher-Order" Logic for Reasoning about Objects, Inheritance, and Scheme , 1989, ACM SIGMOD Conference.

[30]  Weiyi Meng,et al.  Using the Structure of HTML Documents to Improve Retrieval , 1997, USENIX Symposium on Internet Technologies and Systems.

[31]  Jiawei Han,et al.  Resource and knowledge discovery from the internet and multimedia repositories , 1999 .

[32]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1997, International Journal on Digital Libraries.

[33]  Les Carr,et al.  The Distributed Link Service: A Tool for Publishers, Authors, and Readers , 1995, WWW.

[34]  Roy Goldman,et al.  From Semistructured Data to XML: Migrating the Lore Data Model and Query Language , 1999, Markup Lang..

[35]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[36]  Tok Wang Ling,et al.  A Conceptual Model and Rule-Based Query Language for HTML , 2001, World Wide Web.

[37]  Bertram Ludäscher,et al.  Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective , 1998, Inf. Syst..

[38]  Sophie Cluet,et al.  Your mediators need data conversion! , 1998, SIGMOD '98.

[39]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[40]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[41]  Yiu-Kai Ng,et al.  Constructing Hierarchical Information Structures of Sub-Page Level HTML Documents , 1998, International Conference on Foundations of Data Organization and Algorithms.

[42]  Sourav S. Bhowmick,et al.  Schemas for web data: a reverse engineering approach , 2001, Data Knowl. Eng..

[43]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[44]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[45]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[46]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.