Deep Web Navigation by Example

Large portions of the Web are buried behind user-oriented interfaces, which can only be accessed by filling out forms. To make the therein contained information accessible to automatic processing, one of the major hurdles is to navigate to the actual result page. In this paper we present a framework for navigating these so-called Deep Web sites based on the page-keyword-action paradigm: the system fills out forms with provided input parameters and then submits the form. Afterwards it checks if it has already found a result page by looking for pre-specified keyword patterns in the current page. Based on the outcome either further actions to reach a result page are executed or the resulting URL is returned.

[1]  I. V. Ramakrishnan,et al.  WinAgent: a system for creating and executing personal information assistants using a web browser , 2004, IUI '04.

[2]  Alberto Pan,et al.  Maintaining Web Navigation Flows for Wrappers , 2006, DEECS.

[3]  Ángel Viña,et al.  Semi-Automatic Wrapper Generation for Commercial Web Sources , 2002, Engineering Information Systems in the Internet Context.

[4]  Yaron Goland,et al.  Web Services Business Process Execution Language , 2009, Encyclopedia of Database Systems.

[5]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[6]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[7]  Robert Baumgartner,et al.  DeepWeb Navigation in Web Data Extraction , 2005, International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06).

[8]  Georg Lausen,et al.  Mashing Up the DEEP Web - Research in Progress , 2008, WEBIST.

[9]  Alberto H. F. Laender,et al.  Automatic generation of agents for collecting hidden Web pages for data extraction , 2004, Data Knowl. Eng..

[10]  Clement T. Yu,et al.  WISE-Integrator: A System for Extracting and Integrating Complex Web Search Interfaces of the Deep Web , 2005, VLDB.

[11]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[12]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[13]  I. V. Ramakrishnan,et al.  A layered architecture for querying dynamic Web content , 1999, SIGMOD '99.