Accelerating Structured Web Crawling without Losing Data

Size of retrieved data versus crawling time formulate a well-known dilemma in the structured Web crawling community. The real challenge within this dilemma is to optimize the settings of a given wrapper to obtain maximum available data in shortest possible time. In this paper, we try to tune these settings, by introducing a threaded algorithm that guarantees accessing all available detail pages within crawling scope; and using this algorithm, we try to reduce the time consumed by the crawler, via simple adjustments of sleeping time after each detail page visit.

[1]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[2]  Ángel Viña,et al.  The Wargo system: semi-automatic wrapper generation in presence of complex data access modes , 2002, Proceedings. 13th International Workshop on Database and Expert Systems Applications.

[3]  Zhao Li,et al.  WICCAP: from semi-structured data to structured data , 2004, Proceedings. 11th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems, 2004..

[4]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[5]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[6]  Werner Winiwarter,et al.  Deep web integrated systems: current achievements and open issues , 2011, iiWAS '11.

[7]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[8]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[9]  Berkant Barla Cambazoglu,et al.  On the feasibility of geographically distributed web crawling , 2008, Infoscale.

[10]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[11]  Anurag Jain,et al.  A Query based Approach to Reduce the Web Crawler Traffic using HTTP Get Request and Dynamic Web Page , 2011 .

[12]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[13]  Ji-Rong Wen,et al.  Efficient record-level wrapper induction , 2009, CIKM.