On the Automatic Extraction of Data from the Hidden Web

An increasing amount of Web data is accessible only by filling out HTML forms to query an underlying data source. While this is most welcome from a user perspective (queries are easy and precise) and from a data management perspective (static pages need not be maintained; databases can be accessed directly), automated agents have greater difficulty accessing data behind forms. In this paper we present a method for automatically filling in forms to retrieve the associated dynamically generated pages. Using our approach automated agents can begin to systematically access portions of the “hidden Web.”

[1]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[2]  Anand Rajaraman,et al.  Answering queries using templates with binding patterns (extended abstract) , 1995, PODS.

[3]  Peter Tryfos,et al.  Sampling Methods for Applied Research: Text and Cases , 1996 .

[4]  Jeffrey D. Ullman,et al.  Answering queries using templates with binding patterns (extended abstract) , 1995, PODS '95.

[5]  T. Leonard A Course in Categorical Data Analysis , 1999 .

[6]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[7]  A. Tamhane,et al.  Statistics and Data Analysis: From Elementary to Intermediate , 1999 .

[8]  Melinda Miller Holt,et al.  Statistics and Data Analysis From Elementary to Intermediate , 2001, Technometrics.

[9]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[10]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[11]  Chris Beaumont,et al.  The Analysis of Categorical Data (2nd Edition) , 1982 .

[12]  Giles,et al.  Searching the world wide Web , 1998, Science.

[13]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[14]  I. V. Ramakrishnan,et al.  A layered architecture for querying dynamic Web content , 1999, SIGMOD '99.

[15]  Bart Selman,et al.  The Hidden Web , 1997, AI Mag..

[16]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[17]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[18]  Virgil L. Anderson,et al.  Applied factorial and fractional designs , 1984 .

[19]  Robin Plackett The analysis of categorical data , 1974 .