Maintaining Web Navigation Flows for Wrappers

A substantial subset of the web data follows some kind of underlying structure. In order to let software programs gain full benefit from these “semi-structured” web sources, wrapper programs are built to provide a “machine-readable” view over them. A significant problem with wrappers is that, since web sources are autonomous, they may experience changes that invalidate the current wrapper, so automatic maintenance is an important research issue. Web wrappers must perform two kinds of tasks: automatically navigating through websites and automatically extracting structured data from HTML pages. While several previous works have addressed the automatic maintenance of the components performing the data extraction task, the problem of automatically maintaining the required web navigation sequences remains unaddressed to the best of our knowledge. In this paper we propose and expirementally validate a set of novel heuristics and algorithms to fill this gap.

[1]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[2]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[3]  Xiaofeng Meng,et al.  Schema-guided wrapper maintenance for web-data extraction , 2003, WIDM '03.

[4]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[5]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[6]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[7]  David W. Embley,et al.  Extracting Data behind Web Forms , 2002, ER.

[8]  Kanagasabai Rajaraman,et al.  Efficient Wrapper Reinduction from Dynamic Web Sources , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[9]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[10]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[11]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[12]  Nicholas Kushmerick,et al.  Learning to Invoke Web Forms , 2003, OTM.

[13]  Juliana Freire,et al.  Automating Web navigation with the WebVCR , 2000, Comput. Networks.

[14]  Alberto Pan,et al.  Automatically maintaining wrappers for Web sources , 2005, 9th International Database Engineering & Application Symposium (IDEAS'05).

[15]  Alberto Pan,et al.  ITPilot: a toolkit for industrial-strength Web data extraction , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[16]  Nicholas Kushmerick,et al.  Regression testing for wrapper maintenance , 1999, AAAI/IAAI.

[17]  Ángel Viña,et al.  Semi-Automatic Wrapper Generation for Commercial Web Sources , 2002, Engineering Information Systems in the Internet Context.