WEWRA : An algorithm for Wrapper Verification

Web wrappers play an important role in extracting information from distributed web sources and subsequently in the integration of heterogeneous data. Changes in the layout of web sources typically break the wrapper, leading to erroneous extraction of infomation. Monitoring and repairing broken wrappers is an important hurdle for data integration, since it is an expensive and painful procedure. In this paper we present VEWRA, a new approach to wrapper verification, which improves the successful family of trainable content based methods. Compared to its predecessors, the new method aims to capture not only the syntactic patterns but the correlations that exist among them due to the underlying semantics of the extracted information. Experiments show that our method achieves excellent performance, being always better or equal than DATAPROG, the state-of-art related work.

[1]  Olfa Nasraoui,et al.  Web data mining: exploring hyperlinks, contents, and usage data , 2008, SKDD.

[2]  Arnaud Sahuguet,et al.  Web Ecology: Recycling HTML Pages as XML Documents Using W4F , 1999, WebDB.

[3]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[4]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[5]  Michael R. Genesereth,et al.  Infomaster: an information integration system , 1997, SIGMOD '97.

[6]  Vipul Kashyap,et al.  InfoSleuth: agent-based semantic integration of information in open and dynamic environments , 1997, SIGMOD '97.

[7]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[8]  P. Merialdo,et al.  The Araneus Web-Base Management System , 1998, SIGMOD Conference.

[9]  Craig A. Knoblock,et al.  Modeling Web Sources for Information Integration , 1998, AAAI/IAAI.

[10]  Xue Li,et al.  Web Wrapper Validation , 2003, APWeb.

[11]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[12]  AnHai Doan,et al.  Mapping Maintenance for Data Integration Systems , 2005, VLDB.

[13]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[14]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[15]  Nicholas Kushmerick,et al.  Regression testing for wrapper maintenance , 1999, AAAI/IAAI.

[16]  Frank van Harmelen,et al.  A Semantic Web Primer, 2nd Edition (Cooperative Information Systems) , 2008 .

[17]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[18]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.