Integrating Deep-Web Information Sources

Deep-web information sources are difficult to integrate into automated business processes if they only provide a search form. A wrapping agent is a piece of software that allows a developer to query such information sources without worrying about the details of interacting with such forms. Our goal is to help software engineers construct wrapping agents that interpret queries written in high-level structured languages.We think that this shall definitely help reduce integration costs because this shall relieve developers from the burden of transforming their queries into low-level interactions in an ad-hoc manner. In this paper, we report on our reference framework, delve into the related work, and highlight current research challenges. This is intended to help guide future research efforts in this area.

[1]  Victor Carneiro,et al.  A Workflow Language for Web Automation , 2008, J. Univers. Comput. Sci..

[2]  Anil K. Jain,et al.  Text information extraction in images and video: a survey , 2004, Pattern Recognit..

[3]  Alberto H. F. Laender,et al.  Automatic generation of agents for collecting hidden Web pages for data extraction , 2004, Data Knowl. Eng..

[4]  Boris Chidlovskii,et al.  Documentum ECI self-repairing wrappers: performance analysis , 2006, SIGMOD Conference.

[5]  Boualem Benatallah Web Information Systems Engineering - WISE 2007, 8th International Conference on Web Information Systems Engineering, Nancy, France, December 3-7, 2007, Proceedings , 2007, WISE.

[6]  Stefano Spaccapietra,et al.  Conceptual Modeling — ER 2002 , 2002, Lecture Notes in Computer Science.

[7]  David W. Embley,et al.  Extracting Data behind Web Forms , 2002, ER.

[8]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[9]  Lorenzo Blanco,et al.  Efficiently Locating Collections of Web Pages to Wrap , 2005, WEBIST.

[10]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[11]  I. V. Ramakrishnan,et al.  A layered architecture for querying dynamic Web content , 1999, SIGMOD '99.

[12]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[13]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[14]  Nicholas Kushmerick,et al.  Regression testing for wrapper maintenance , 1999, AAAI/IAAI.

[15]  Clement T. Yu,et al.  Querying Capability Modeling and Construction of Deep Web Sources , 2007, WISE.

[16]  Juliana Freire,et al.  Automating Web navigation with the WebVCR , 2000, Comput. Networks.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[18]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[19]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[20]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[21]  Alin Deutsch,et al.  Exporting and interactively querying Web service-accessed sources: The CLIDE System , 2007, TODS.

[22]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[23]  Wai Lam,et al.  Adapting Web information extraction knowledge via mining site-invariant and site-dependent features , 2007, TOIT.

[24]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[25]  Kristina Lerman,et al.  Information Integration for the Masses , 2008, J. Univers. Comput. Sci..

[26]  Robert L. Grossman,et al.  Mining Web pages for data records , 2004, IEEE Intelligent Systems.

[27]  Kevin Chen-Chuan Chang,et al.  Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly , 2005, VLDB.

[28]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[29]  Edleno Silva de Moura,et al.  Structure-Based Crawling in the Hidden Web , 2008, J. Univers. Comput. Sci..

[30]  J. Ross Quinlan Learning First-Order Definitions of Functions , 1996, J. Artif. Intell. Res..

[31]  Robert Baumgartner,et al.  DeepWeb Navigation in Web Data Extraction , 2005, International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06).

[32]  Ángel Viña,et al.  A Model for Advanced Query Capability Description in Mediator Systems , 2002, ICEIS.

[33]  Clement T. Yu,et al.  Towards Deeper Understanding of the Search Interfaces of the Deep Web , 2006, World Wide Web.

[34]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[35]  AnHai Doan,et al.  Mapping Maintenance for Data Integration Systems , 2005, VLDB.

[36]  Loredana Afanasiev,et al.  Harnessing the Deep Web: Present and Future , 2009, CIDR.

[37]  Carlos R. Rivero From queries to search forms: an implementation , 2008, Int. J. Comput. Appl. Technol..

[38]  David R. Karger,et al.  Thresher: automating the unwrapping of semantic content from the World Wide Web , 2005, WWW '05.