Semi-automatic wrapper generation for Internet information sources

To simplify the task of obtaining information from the vast number of information sources that are available on the World Wide Web (WWW), the authors are building information mediators for extracting and integrating data from multiple Web sources. In a mediator based approach, wrappers are built around individual information sources to translate between the mediator query language and the individual sources. They present an approach for semi-automatically generating wrappers for structured Internet sources. The key idea is to exploit formatting information in Web pages to hypothesize the underlying structure of a page. From this structure the system generates a wrapper that facilitates querying of a source and possibly integrating it with other sources. They demonstrate the ease with which they are able to build wrappers for a number of Web sources using their implemented wrapper generation toolkit.

[1]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[2]  Eduardo Mena Nieto Observer: an approach for query processing in global information systems based on interoperation across pre-existing ontologies , 1999 .

[3]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[4]  Murray Hill,et al.  Yacc: Yet Another Compiler-Compiler , 1978 .

[5]  Serge Abiteboul,et al.  From structured documents to novel query facilities , 1994, SIGMOD '94.

[6]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[7]  Timothy W. Finin,et al.  KQML as an agent communication language , 1994, CIKM '94.

[8]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[9]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[10]  Divesh Srivastava,et al.  The Information Manifold , 1995 .

[11]  Hector Garcia-Molina,et al.  Template-based wrappers in the TSIMMIS system , 1997, SIGMOD '97.

[12]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[13]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[14]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[15]  Oren Etzioni,et al.  A softbot-based interface to the Internet , 1994, CACM.

[16]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[17]  E. Schmidt,et al.  Lex—a lexical analyzer generator , 1990 .

[18]  David Konopnicki,et al.  Information gathering in the World-Wide Web: the W3QL query language and the W3QS system , 1998, TODS.

[19]  Vipul Kashyap,et al.  InfoSleuth: Semantic Integration of Information in Open and Dynamic Environments (Experience Paper) , 1997, SIGMOD Conference.