Wrapper generation for Web accessible data sources

There is an increase in the number of data sources that can be queried across the WWW. Such sources typically support HTML forms-based interfaces and search engines query collections of suitably indexed data. The data is displayed via a browser: One drawback to these sources is that there is no standard programming interface suitable for applications to submit queries. Second, the output (answer to a query) is not well structured. Structured objects have to be extracted from the HTML documents which contain irrelevant data and which may be volatile. Third, domain knowledge about the data source is also embedded in HTML documents and must be extracted. To solve these problems, we present technology to define and (automatically) generate wrappers for Web accessible sources. Our contributions are as follows: (1) Defining a wrapper interface to specify the capability of Web accessible data sources. (2) Developing a wrapper generation toolkit of graphical interfaces and specification languages to specify the capability of sources and the functionality of the wrapper (3) Developing the technology to automatically generate a wrapper appropriate to the Web accessible source, from the specifications.

[1]  Serge Abiteboul,et al.  Inferring structure in semistructured data , 1997, SGMD.

[2]  Béatrice Finance,et al.  IRO-DB: a distributed system federating object and relational databases , 1995 .

[3]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[4]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[5]  Patrick Valduriez,et al.  Dealing with Discrepancies in Wrapper Functionality , 1997, BDA.

[6]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[7]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[8]  Louiqa Raschid,et al.  Equal Time for Data on the Internet with WebSemantics , 1998, EDBT.

[9]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[10]  Michael Kifer,et al.  Querying object-oriented databases , 1992, SIGMOD '92.

[11]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[12]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[13]  Paolo Merialdo,et al.  Semistructured and structured data in the Web: going back and forth , 1997, SGMD.

[14]  Patrick Valduriez,et al.  Scaling heterogeneous databases and the design of Disco , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[15]  Dan Suciu,et al.  A Query Language and Processor for a Web-Site Management System , 1997 .

[16]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[17]  ZhaoHui Tang,et al.  Cost-based Selection of Path Expression Processing Algorithms in Object-Oriented Databases , 1996, VLDB.

[18]  Jeffrey D. Ullman,et al.  A Query Translation Scheme for Rapid Implementation of Wrappers , 1995, DOOD.