Adapting Searchy to extract data using evolved wrappers

Highlights? Variable-Length Genetic Algorithm can be used to automatically learn regular expressions using a set of positive and negative examples. ? We proposed an algorithm based on Zipf's law to build an alphabet of tokens. ? Genetic algorithms can be introduced in the Searchy agent platform in a data extraction environment as evolutive wrappers. Organizations need diverse information systems to deal with the increasing requirements in information storage and processing, yielding the creation of information islands and therefore an intrinsic difficulty to obtain a global view. Being able to provide such an unified view of the -likely heterogeneous-information available in an organization is a goal that provides added-value to the information systems and has been subject of intense research. In this paper we present an extension of a solution named Searchy, an agent-based mediator system specialized in data extraction and Integration. Through the use of a set of wrappers, it integrates information from arbitrary sources and semantically translates them according to a mediated scheme. Searchy is actually a domain-independent wrapper container that ease wrapper development, providing, for example, semantic mapping. The extension of Searchy proposed in this paper introduces an evolutionary wrapper that is able to evolve wrappers using regular expressions. To achieve this, a Genetic Algorithm (GA) is used to learn a regex able to extract a set of positive samples while rejects a set of negative samples.

[1]  Feng Wan,et al.  Commitments and causality for multiagent design , 2003, AAMAS '03.

[2]  Tomasz Ksiezyk,et al.  Intelligent Integration of Information. , 2000 .

[3]  Deepti Parachuri,et al.  Semantic Web Services in Action - Enterprise Information Integration , 2007, ICSOC.

[4]  María Dolores Rodríguez-Moreno,et al.  Confidence intervals of success rates in evolutionary computation , 2010, GECCO '10.

[5]  Craig A. Knoblock,et al.  New Directions: Agents for Information Gathering , 1997, IEEE Expert.

[6]  María Dolores Rodríguez-Moreno,et al.  Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions , 2009, Data Mining and Multi-agent Integration.

[7]  David F. Barrero,et al.  Semantic Wrappers for Semi-Structured Data Extraction 1 , 2008 .

[8]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[9]  Michael Uschold,et al.  Ontologies and semantics for seamless connectivity , 2004, SGMD.

[10]  Vasant Honavar,et al.  Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources , 2005, DILS.

[11]  María Dolores Rodríguez-Moreno,et al.  Information Integration in Searchy: An Ontology and Web Services Based Approach , 2010, Int. J. Comput. Sci. Appl..

[12]  Larry Kerschberg,et al.  Knowledge Sifter: ontology-driven search over heterogeneous databases , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[13]  Natalya F. Noy,et al.  Semantic integration: a survey of ontology-based approaches , 2004, SGMD.

[14]  Geert-Jan Houben,et al.  RDF-Based Architecture for Semantic Integration of Heterogeneous Information Sources , 2001, Workshop on Information Integration on the Web.

[15]  Craig A. Knoblock,et al.  Retrieving and semantically integrating heterogeneous data from the Web , 2004, IEEE Intelligent Systems.

[16]  Marian H. Nodine,et al.  Active Information Gathering in InfoSleuth , 1999, CODAS.

[17]  Rajendra Akerkar,et al.  Semantic Wrappers for Semi-Structured Data Extraction , 2008 .

[18]  Heiner Stuckenschmidt,et al.  Ontology-Based Integration of Information - A Survey of Existing Approaches , 2001, OIS@IJCAI.

[19]  Changzhou Wang,et al.  A semantic information integration tool suite , 2006, VLDB.

[20]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[21]  José María Valls,et al.  Programming Robosoccer agents by modeling human behavior , 2009, Expert Syst. Appl..

[22]  Linlin Ge,et al.  Learning Ranking Functions for Geographic Information Retrieval Using Genetic Programming , 2009, J. Res. Pract. Inf. Technol..

[23]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[24]  Laura M. Haas,et al.  Beauty and the Beast: The Theory and Practice of Information Integration , 2007, ICDT.

[26]  Jie Xu,et al.  Dynamic data integration using Web services , 2004 .