Extracting Data behind Web Forms

A significant and ever-increasing amount of data is accessible only by filling out HTML forms to query an underlying Web data source. While this is most welcome from a user perspective (queries are relatively easy and precise) and from a data management perspective (static pages need not be maintained and databases can be accessed directly), automated agents must face the challenge of obtaining the data behind forms. In principle an agent can obtain all the data behind a form by multiple submissions of the form filled out in all possible ways, but efficiency concerns lead us to consider alternatives. We investigate these alternatives and show that we can estimate the amount of remaining data (if any) after a small number of submissions and that we can heuristically select a reasonably minimal number of submissions to maximize the coverage of the data. Experimental results show that these statistical predictions are appropriate and useful.

[1]  Chris Beaumont,et al.  The Analysis of Categorical Data (2nd Edition) , 1982 .

[2]  I. V. Ramakrishnan,et al.  A layered architecture for querying dynamic Web content , 1999, SIGMOD '99.

[3]  Melinda Miller Holt,et al.  Statistics and Data Analysis From Elementary to Intermediate , 2001, Technometrics.

[4]  David W. Embley,et al.  On the Automatic Extraction of Data from the Hidden Web , 2001, ER.

[5]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[6]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[7]  Anand Rajaraman,et al.  Answering queries using templates with binding patterns (extended abstract) , 1995, PODS.

[8]  Virgil L. Anderson,et al.  Applied factorial and fractional designs , 1984 .

[9]  Robin Plackett The analysis of categorical data , 1974 .

[10]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[11]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[12]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[13]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[14]  Giles,et al.  Searching the world wide Web , 1998, Science.

[15]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[16]  Clifford C. Clogg,et al.  The Analysis of Categorical Data (2nd Ed.). , 1983 .

[17]  T. Leonard A Course in Categorical Data Analysis , 1999 .