Crawling programs for wrapper-based applications

Many large web sites provide pages containing highly valuable data. In order to extract data from these pages several methods and techniques have been developed to generate web wrappers, that is, programs that convert into a structured format the data embedded into HTML pages. These techniques easy the burden of writing applications that make reuse of data from the web. However the generation of wrappers is just one of the ingredients needed to the development of such applications. A necessary yet underestimated task is that of developing programs for driving a crawler towards the pages that contain the target data. We present a method and an associated tool to support this activity. Our method relies on a data model whose constructs allows a designer to define an intensional description of the organization of data in a web site. Based on the model, we introduce the concepts of (i) intensional navigation, which represents an abstract description of the navigation to be performed to reach pages of interest, and (ii) extensional navigation, which represents the actual set of navigation paths (i.e. sequences of links to be followed) that lead the target pages. The method is supported by a tool that infers an intensional navigation, i.e. the crawling program, from one sample extensional navigation. The tool, which has been developed as a Firefox plug-in, supports the designer in the task of defining and verifying the sample navigation and the inferred crawling program.

[1]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[2]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[3]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[4]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[5]  Edleno Silva de Moura,et al.  GoGetIt!: a tool for generating structure-driven web crawlers , 2006, WWW '06.

[6]  Frederick H. Lochovsky,et al.  Data-rich section extraction from HTML pages , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[7]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[8]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Ming-Syan Chen,et al.  Mining Web informative structures and contents based on entropy analysis , 2004, IEEE Transactions on Knowledge and Data Engineering.

[10]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[11]  Georg Gottlob,et al.  Visual Programming of Web Data Aggregation Applications , 2003, IIWeb.

[12]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[13]  Valter Crescenzi,et al.  Clustering Web pages based on their structure , 2005, Data Knowl. Eng..

[14]  Edleno Silva de Moura,et al.  Structure-driven crawler generation by example , 2006, SIGIR.

[15]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[16]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[17]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[18]  Michael Benedikt,et al.  VeriWeb: Automatically Testing Dynamic Web Sites , 2002 .

[19]  Juliana Freire,et al.  Automating Web navigation with the WebVCR , 2000, Comput. Networks.

[20]  Oren Etzioni,et al.  A Grammar Inference Algorithm for the World Wide Web , 2002 .

[21]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[22]  Tobias Anton XPath-Wrapper Induction by generating tree traversal patterns , 2005, LWA.

[23]  Robert Baumgartner,et al.  DeepWeb Navigation in Web Data Extraction , 2005, International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06).

[24]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[25]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[26]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[27]  Ee-Peng Lim,et al.  An Automated Algorithm for Extracting Website Skeleton , 2004, DASFAA.

[28]  Valter Crescenzi,et al.  Fine-grain web site structure discovery , 2003, WIDM '03.