论文信息 - Wrapper generation for semi-structured Internet sources

Wrapper generation for semi-structured Internet sources

With the current explosion of information on the World Wide Web (WWW) a wealth of information on many different subjects has become available on-line. Numerous sources contain information that can be classified as semi-structured. At present, however, the only way to access the information is by browsing individual pages. We cannot query web documents in a database-like fashion based on their underlying structure. However, we can provide database-like querying for semi-structured WWW sources by building wrappers around these sources. We present an approach for semi-automatically generating such wrappers. The key idea is to exploit the formatting information in pages from the source to hypothesize the underlying structure of a page. From this structure the system generates a wrapper that facilitates querying of a source and possibly integrating it with other sources. We demonstrate the ease with which we are able to build wrappers for a number of internet sources in different domains using our implemented wrapper generation toolkit.

Craig A. Knoblock | Naveen Ashish

[1] Nicholas Kushmerick,et al. Wrapper Induction for Information Extraction , 1997, IJCAI.

[2] Murray Hill,et al. Yacc: Yet Another Compiler-Compiler , 1978 .

[3] Hector Garcia-Molina,et al. Template-based wrappers in the TSIMMIS system , 1997, SIGMOD '97.

[4] Serge Abiteboul,et al. From structured documents to novel query facilities , 1994, SIGMOD '94.

[5] Dan Suciu,et al. A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[6] E. Schmidt,et al. Lex—a lexical analyzer generator , 1990 .

[7] David Konopnicki,et al. W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[8] Alberto O. Mendelzon,et al. Querying the World Wide Web , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[9] Timothy W. Finin,et al. KQML as an agent communication language , 1994, CIKM '94.