Intelligent crawling of web applications for web archiving

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently accessed (which leads to suboptimal crawling strategies) and whatever structured content is contained in Web pages (which results in page-level archives whose content is hard to exploit). We focus in this PhD work on the crawling and archiving of publicly accessible Web applications, especially those of the social Web. A Web application is any application that uses Web standards such as HTML and HTTP to publish information on the Web, accessible by Web browsers. Examples include Web forums, social networks, geolocation services, etc. We claim that the best strategy to crawl these applications is to make the Web crawler aware of the kind of application currently processed, allowing it to refine the list of URLs to process, and to annotate the archive with information about the structure of crawled content. We add adaptive characteristics to an archival Web crawler: being able to identify when a Web page belongs to a given Web application and applying the appropriate crawling and content extraction methodology.

[1]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[2]  Kristinn Sigurðsson Incremental Crawling with Heritrix , 2010 .

[3]  Lars Littig Classifying web sites , 2007, WWW '07.

[4]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[5]  Tok Wang Ling,et al.  A rule-based query language for HTML , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[6]  Hao Zhang,et al.  Path sharing and predicate evaluation for high-performance XML filtering , 2003, TODS.

[7]  Julien Masanès,et al.  Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[8]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[9]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[10]  David Carmel,et al.  The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[11]  Hiroyuki Kitagawa,et al.  Wraplet: Wrapping Your Web Contents with a Lightweight Language , 2007, 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System.

[12]  Yan Guo,et al.  Board Forum Crawling: A Web Crawling Method for Web Forum , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[13]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[14]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[15]  I-Chen Wu,et al.  ON DESIGN OF BROWSER-ORIENTED DATA EXTRACTION SYSTEM AND THE PLUG-INS , 2010 .

[16]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[17]  Christoph Lindemann,et al.  Coarse-grained classification of web sites by their structural properties , 2006, WIDM '06.

[18]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[19]  Tim Furche,et al.  OXPath , 2011, Proc. VLDB Endow..

[20]  Julien Masanès Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[21]  Dennis Shasha,et al.  WebFilter: A High-throughput XML-based Publish and Subscribe System , 2001, VLDB.

[22]  Juliana Freire,et al.  Searching for Hidden-Web Databases , 2005, WebDB.

[23]  Ruihua Song,et al.  Joint optimization of wrapper generation and template detection , 2007, KDD '07.

[24]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.