WebDB: A System for Querying Semi-structured Data on the Web

Abstract The World-Wide Web can be viewed as a collection of semi-structured multimedia documents in the form of Web pages connected through hyperlinks. Unlike most web search engines, which primarily focus on information retrieval functionality, WebDB aims at supporting a comprehensive database-like query functionality, including selection, aggregation, sorting, summary, grouping, and projection. WebDB allows users to access (1) document level information, such as title, URL, length, keywords types and last modified date; (2) intra-document structures, such as tables, forms and images and (3) inter-document linkage information, such as destination URLs and anchors. With these three types of information, comprehensive queries for complex Web-based applications, such as Web mining and Web site management, can be answered. WebDB is based on object-relational concepts: Object-oriented modeling and relational query language. In this paper, we present the data model, language and implementation of WebDB. We also present the novel visual query/browsing interface for semi-structured Web and Web documents. Our system provides high usability compared with other existing systems.

[1]  Jennifer Widom,et al.  Representing and querying changes in semistructured data , 1998, Proceedings 14th International Conference on Data Engineering.

[2]  Dan Suciu,et al.  Optimizing regular path expressions using graph schemas , 1998, Proceedings 14th International Conference on Data Engineering.

[3]  C. Bufi,et al.  Integrated Search Engine , 1997, Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop.

[4]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[5]  K. Selçuk Candan,et al.  IFQ: a visual query interface and query generator for object-based media retrieval , 1997, Proceedings of IEEE International Conference on Multimedia Computing and Systems.

[6]  Paolo Merialdo,et al.  To Weave the Web , 1997, VLDB.

[7]  Luis Gravano,et al.  Merging Ranks from Heterogeneous Internet Sources , 1997, VLDB.

[8]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[9]  Dan Suciu,et al.  Implementation and Analysis of a Parallel Collection Query Language , 1996, VLDB.

[10]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[11]  Jennifer Widom,et al.  Information translation, mediation, and mosaic-based browsing in the TSIMMIS system , 1995, SIGMOD '95.

[12]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[13]  Dan Suciu,et al.  A query language for a Web-site management system , 1997, SGMD.

[14]  Dan Suciu,et al.  STRUDEL: a Web site management system , 1997, SIGMOD '97.

[15]  Michael J. Swain,et al.  WebSeer: An Image Search Engine for the World Wide Web , 1996 .

[16]  Himanshu Sinha,et al.  GTE SuperPages: Using IR Techniques for Searching Complex Objects , 1997, VLDB.

[17]  Alberto O. Mendelzon,et al.  Applications of a Web Query Language , 1997, Comput. Networks.

[18]  K. Selçuk Candan,et al.  SEMCOG: a hybrid object-based image database system and its modeling, language, and query processing , 1998, Proceedings 14th International Conference on Data Engineering.

[19]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[20]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[21]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1997, International Journal on Digital Libraries.

[22]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.