Intelligent and Adaptive Crawling of Web Applications for Web Archiving

Web sites are dynamic in nature with content and structure changing overtime. Many pages on the Web are produced by content management systems (CMSs) such as WordPress, vBulletin, or phpBB. Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on (leading to suboptimal crawling strategies) and whatever structured content is contained in Web pages (resulting in page-level archives whose content is hard to exploit). We present in this paper an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications (e.g., the pages served by a CMS). Because the AAH is aware of the Web application currently crawled, it is able to refine the list of URLs to process and to extend the archive with semantic information about extracted content. To deal with possible changes in structure of Web applications, our AAH includes an adaptation module that makes crawling resilient to small changes in the structure of Web site. We show the value of our approach by comparing the output and efficiency of the AAH with respect to regular Web crawlers, also in the presence of structure change.

[1]  Emilio Ferrara,et al.  Automatic Wrapper Adaptation by Tree Edit Distance Matching , 2011, ArXiv.

[2]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[3]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[4]  Muhammad Faheem Intelligent crawling of web applications for web archiving , 2012, WWW.

[5]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[6]  Julien Masanès Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[7]  Julien Masanès,et al.  Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[8]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[9]  Xiaofeng Meng,et al.  Schema-guided wrapper maintenance for web-data extraction , 2003, WIDM '03.

[10]  Hao Zhang,et al.  Path sharing and predicate evaluation for high-performance XML filtering , 2003, TODS.

[11]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[12]  Stephen Coleman,et al.  Blogs and the New Politics of Listening , 2005 .

[13]  Boris Chidlovskii Automatic repairing of web wrappers , 2001, WIDM '01.

[14]  Michael S. Chase,et al.  You've Got Dissent!: Chinese Dissident Use of the Internet and Beijing's Counter-Strategies , 2002 .

[15]  Yiu-Kai Ng,et al.  An automated change-detection algorithm for HTML documents based on semantic hierarchies , 2001, Proceedings 17th International Conference on Data Engineering.

[16]  Nicholas Kushmerick,et al.  Regression testing for wrapper maintenance , 1999, AAAI/IAAI.

[17]  Yan Guo,et al.  Board Forum Crawling: A Web Crawling Method for Web Forum , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[18]  Hassan Artail,et al.  A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations , 2008, Data Knowl. Eng..

[19]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[20]  Kristinn Sigurðsson Incremental Crawling with Heritrix , 2010 .

[21]  Vrizlynn L. L. Thing,et al.  An enhanced intelligent forum crawler , 2012, 2012 IEEE Symposium on Computational Intelligence for Security and Defence Applications.

[22]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[23]  Ioannis Hatzilygeroudis,et al.  Combinations of Intelligent Methods and Applications , 2011 .