Information gathering in the World-Wide Web: the W3QL query language and the W3QS system

The World Wide Web (WWW) is a fast growing global information resource. It contains an enormous amount of information and provides access to a variety of services. Since there is no central control and very few standards of information organization or service offering, searching for information and services is a widely recognized problem. To some degree this problem is solved by “search services,” also known as “indexers,” such as Lycos, AltaVista, Yahoo, and others. These sites employ search engines known as “robots” or “knowbots” that scan the network periodically and form text-based indices. These services are limited in certain important aspects. First, the structural information, namely, the organization of the document into parts pointing to each other, is usually lost. Second, one is limited by the kind of textual analysis provided by the “search service.” Third, search services are incapable of navigating “through” forms. Finally, one cannot prescribe a complex database-like search. We view the WWW as a huge database. We have designed a high-level SQL-like language called W3QL to support effective and flexible query processing, which addresses the structure and content of WWW nodes and their varied sorts of data. We have implemented a system called W3QS to execute W3QL queries. In W3QS, query results are declaratively specified and continuously maintained as views when desired. The current architecture of W3QS provides a server that enables users to pose queries as well as integrate their own data analysis tools. The system and its query language set a framework for the development of database-like tools over the WWW. A significant contribution of this article is in formalizing the WWW and query processing over it.