With the current explosion of information on the World Wide Web (WWW) a wealth of information on many different subjects has become available on-line. Numerous sources contain information that can be classified as semi-structured. At present, however, the only way to access the information is by browsing individual pages. We cannot query web documents in a database-like fashion based on their underlying structure. However, we can provide database-like querying for semi-structured WWW sources by building wrappers around these sources. We present an approach for semi-automatically generating such wrappers. The key idea is to exploit the formatting information in pages from the source to hypothesize the underlying structure of a page. From this structure the system generates a wrapper that facilitates querying of a source and possibly integrating it with other sources. We demonstrate the ease with which we are able to build wrappers for a number of internet sources in different domains using our implemented wrapper generation toolkit.
[1]
Nicholas Kushmerick,et al.
Wrapper Induction for Information Extraction
,
1997,
IJCAI.
[2]
Murray Hill,et al.
Yacc: Yet Another Compiler-Compiler
,
1978
.
[3]
Hector Garcia-Molina,et al.
Template-based wrappers in the TSIMMIS system
,
1997,
SIGMOD '97.
[4]
Serge Abiteboul,et al.
From structured documents to novel query facilities
,
1994,
SIGMOD '94.
[5]
Dan Suciu,et al.
A query language and optimization techniques for unstructured data
,
1996,
SIGMOD '96.
[6]
E. Schmidt,et al.
Lex—a lexical analyzer generator
,
1990
.
[7]
David Konopnicki,et al.
W3QS: A Query System for the World-Wide Web
,
1995,
VLDB.
[8]
Alberto O. Mendelzon,et al.
Querying the World Wide Web
,
1996,
Fourth International Conference on Parallel and Distributed Information Systems.
[9]
Timothy W. Finin,et al.
KQML as an agent communication language
,
1994,
CIKM '94.